Hail is an open-source Python library that simplifies genomic data analysis. It provides powerful, easy-to-use data science tools that can be used to interrogate even biobank-scale genomic data (e.g. UK Biobank, gnomAD, TopMed, FinnGen, and Biobank Japan).
Modern data science is driven by numeric matrices (see Numpy) and tables (see R and Pandas). While sufficient for many tasks, none of these tools adequately capture the structure of genetic data. Genetic data combines multiple axes (variants and samples) like matrices and structured entries (genotypes) like tables or dataframes. To support genomic analysis, Hail introduces a powerful, distributed data structure combining features of matrices and dataframes called MatrixTable.
The Hail MatrixTable unifies a wide range of input formats (e.g. vcf, bgen, plink, tsv, gtf, bed files), and supports scalable queries, even on petabyte-size datasets. By leveraging MatrixTable, Hail provides an integrated, scalable analysis platform for science.