Genetics

This page tailored how-to guides for small but commonly-used patterns appearing in genetics pipelines. For documentation on the suite of genetics functions built into Hail, see the genetics methods page.

Formatting

Convert variants in string format to separate locus and allele fields

code:

>>> ht = ht.key_by(**hl.parse_variant(ht.variant))

dependencies:

parse_variant(), key_by()

understanding:

If your variants are strings of the format ‘chr:pos:ref:alt’, you may want to convert them to separate locus and allele fields. This is useful if you have imported a table with variants in string format and you would like to join this table with other Hail tables that are keyed by locus and alleles.

hl.parse_variant(ht.variant) constructs a StructExpression containing two nested fields for the locus and alleles. The ** syntax unpacks this struct so that the resulting table has two new fields, locus and alleles.

Liftover variants from one coordinate system to another

tags:

liftover

description:

Liftover a Table or MatrixTable from one reference genome to another.

code:

First, we need to set up the two reference genomes (source and destination):

>>> rg37 = hl.get_reference('GRCh37')  
>>> rg38 = hl.get_reference('GRCh38')  
>>> rg37.add_liftover('gs://hail-common/references/grch37_to_grch38.over.chain.gz', rg38)  

In order to use the above verbatim, you may need to install the GCS connector.

Then we can liftover the locus coordinates in a Table or MatrixTable (here, ht) from reference genome 'GRCh37' to 'GRCh38':

>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38'))  
>>> ht = ht.filter(hl.is_defined(ht.new_locus))  
>>> ht = ht.key_by(locus=ht.new_locus)  

Note that this approach does not retain the old locus, nor does it verify that the allele has not changed strand. We can keep the old one for reference and filter out any liftover that changed strands using:

>>> ht = ht.annotate(new_locus=hl.liftover(ht.locus, 'GRCh38', include_strand=True),
...                  old_locus=ht.locus)  
>>> ht = ht.filter(hl.is_defined(ht.new_locus) & ~ht.new_locus.is_negative_strand)  
>>> ht = ht.key_by(locus=ht.new_locus.result)  

dependencies:

liftover(), add_liftover(), get_reference()

Filtering and Pruning

Filter loci by a list of locus intervals

From a table of intervals

tags:

genomic region, genomic range

description:

Import a text file of locus intervals as a table, then use this table to filter the loci in a matrix table.

code:

>>> interval_table = hl.import_locus_intervals('data/gene.interval_list', reference_genome='GRCh37')
>>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))

dependencies:

import_locus_intervals(), MatrixTable.filter_rows()

understanding:

We have a matrix table mt containing the loci we would like to filter, and a list of locus intervals stored in a file. We can import the intervals into a table with import_locus_intervals().

Hail supports implicit joins between locus intervals and loci, so we can filter our dataset to the rows defined in the join between the interval table and our matrix table.

interval_table[mt.locus] joins the matrix table with the table of intervals based on locus and interval<locus> matches. This is a StructExpression, which will be defined if the locus was found in any interval, or missing if the locus is outside all intervals.

To do our filtering, we can filter to the rows of our matrix table where the struct expression interval_table[mt.locus] is defined.

This method will also work to filter a table of loci, as well as a matrix table.

From a UCSC BED file

description:

Import a UCSC BED file as a table of intervals, then use this table to filter the loci in a matrix table.

code:

>>> interval_table = hl.import_bed('data/file1.bed', reference_genome='GRCh37')
>>> filtered_mt = mt.filter_rows(hl.is_defined(interval_table[mt.locus]))

dependencies:

import_bed(), MatrixTable.filter_rows()

Using `hl.filter_intervals`

description:

Filter using an interval table, suitable for a small list of intervals.

code:

>>> filtered_mt = hl.filter_intervals(mt, interval_table['interval'].collect())

dependencies:

methods.filter_intervals()

Declaring intervals with `hl.parse_locus_interval`

description:

Filter to declared intervals.

code:

>>> intervals = ['1:100M-200M', '16:29.1M-30.2M', 'X']
>>> filtered_mt = hl.filter_intervals(
...     mt,
...     [hl.parse_locus_interval(x, reference_genome='GRCh37') for x in intervals])

dependencies:

methods.filter_intervals(), parse_locus_interval()

Pruning Variants in Linkage Disequilibrium

tags:

LD Prune

description:

Remove correlated variants from a matrix table.

code:

>>> biallelic_mt = mt.filter_rows(hl.len(mt.alleles) == 2)
>>> pruned_variant_table = hl.ld_prune(mt.GT, r2=0.2, bp_window_size=500000)
>>> filtered_mt = mt.filter_rows(
...     hl.is_defined(pruned_variant_table[mt.row_key]))

dependencies:

ld_prune()

understanding:

Hail’s ld_prune() method takes a matrix table and returns a table with a subset of variants which are uncorrelated with each other. The method requires a biallelic dataset, so we first filter our dataset to biallelic variants. Next, we get a table of independent variants using ld_prune(), which we can use to filter the rows of our original dataset.

Note that it is more efficient to do the final filtering step on the original dataset, rather than on the biallelic dataset, so that the biallelic dataset does not need to be recomputed.

Analysis

Linear Regression

Single Phenotype

tags:

Linear Regression

description:

Compute linear regression statistics for a single phenotype.

code:

Approach #1: Use the linear_regression_rows() method

>>> ht = hl.linear_regression_rows(y=mt.pheno.height,
...                                x=mt.GT.n_alt_alleles(),
...                                covariates=[1])

Approach #2: Use the aggregators.linreg() aggregator

>>> mt_linreg = mt.annotate_rows(linreg=hl.agg.linreg(y=mt.pheno.height,
...                                                   x=[1, mt.GT.n_alt_alleles()]))

dependencies:

linear_regression_rows(), aggregators.linreg()

understanding:

The linear_regression_rows() method is more efficient than using the aggregators.linreg() aggregator. However, the aggregators.linreg() aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics.

Multiple Phenotypes

tags:

Linear Regression

description:

Compute linear regression statistics for multiple phenotypes.

code:

Approach #1: Use the linear_regression_rows() method for all phenotypes simultaneously

>>> ht_result = hl.linear_regression_rows(y=[mt.pheno.height, mt.pheno.blood_pressure],
...                                       x=mt.GT.n_alt_alleles(),
...                                       covariates=[1])

Approach #2: Use the linear_regression_rows() method for each phenotype sequentially

>>> ht1 = hl.linear_regression_rows(y=mt.pheno.height,
...                                 x=mt.GT.n_alt_alleles(),
...                                 covariates=[1])

>>> ht2 = hl.linear_regression_rows(y=mt.pheno.blood_pressure,
...                                 x=mt.GT.n_alt_alleles(),
...                                 covariates=[1])

Approach #3: Use the aggregators.linreg() aggregator

>>> mt_linreg = mt.annotate_rows(
...     linreg_height=hl.agg.linreg(y=mt.pheno.height,
...                                 x=[1, mt.GT.n_alt_alleles()]),
...     linreg_bp=hl.agg.linreg(y=mt.pheno.blood_pressure,
...                             x=[1, mt.GT.n_alt_alleles()]))

dependencies:

linear_regression_rows(), aggregators.linreg()

understanding:

The linear_regression_rows() method is more efficient than using the aggregators.linreg() aggregator, especially when analyzing many phenotypes. However, the aggregators.linreg() aggregator is more flexible (multiple covariates can vary by entry) and returns a richer set of statistics. The linear_regression_rows() method drops samples that have a missing value for any of the phenotypes. Therefore, Approach #1 may not be suitable for phenotypes with differential patterns of missingness. Approach #2 will do two passes over the data while Approaches #1 and #3 will do one pass over the data and compute the regression statistics for each phenotype simultaneously.

Using Variants (SNPs) as Covariates

tags:

sample genotypes covariate

description:

Use sample genotype dosage at specific variant(s) as covariates in regression routines.

code:

Create a sample annotation from the genotype dosage for each variant of interest by combining the filter and collect aggregators:

>>> mt_annot = mt.annotate_cols(
...     snp1 = hl.agg.filter(hl.parse_variant('20:13714384:A:C') == mt.row_key,
...                          hl.agg.collect(mt.GT.n_alt_alleles()))[0],
...     snp2 = hl.agg.filter(hl.parse_variant('20:17479730:T:C') == mt.row_key,
...                          hl.agg.collect(mt.GT.n_alt_alleles()))[0])

Run the GWAS with linear_regression_rows() using variant dosages as covariates:

>>> gwas = hl.linear_regression_rows(  
...     x=mt_annot.GT.n_alt_alleles(),
...     y=mt_annot.pheno.blood_pressure,
...     covariates=[1, mt_annot.pheno.age, mt_annot.snp1, mt_annot.snp2])

dependencies:

linear_regression_rows(), aggregators.collect(), parse_variant(), variant_str()

Stratified by Group

tags:

Linear Regression

description:

Compute linear regression statistics for a single phenotype stratified by group.

code:

Approach #1: Use the linear_regression_rows() method for each group

>>> female_pheno = (hl.case()
...                   .when(mt.pheno.is_female, mt.pheno.height)
...                   .or_missing())

>>> linreg_female = hl.linear_regression_rows(y=female_pheno,
...                                           x=mt.GT.n_alt_alleles(),
...                                           covariates=[1])

>>> male_pheno = (hl.case()
...                 .when(~mt.pheno.is_female, mt.pheno.height)
...                 .or_missing())

>>> linreg_male = hl.linear_regression_rows(y=male_pheno,
...                                         x=mt.GT.n_alt_alleles(),
...                                         covariates=[1])

Approach #2: Use the aggregators.group_by() and aggregators.linreg() aggregators

>>> mt_linreg = mt.annotate_rows(
...     linreg=hl.agg.group_by(mt.pheno.is_female,
...                            hl.agg.linreg(y=mt.pheno.height,
...                                          x=[1, mt.GT.n_alt_alleles()])))

dependencies:

linear_regression_rows(), aggregators.group_by(), aggregators.linreg()

understanding:

We have presented two ways to compute linear regression statistics for each value of a grouping variable. The first approach utilizes the linear_regression_rows() method and must be called separately for each group even though it can compute statistics for multiple phenotypes simultaneously. This is because the linear_regression_rows() method drops samples that have a missing value for any of the phenotypes. When the groups are mutually exclusive, such as ‘Male’ and ‘Female’, no samples remain! Note that we cannot define male_pheno = ~female_pheno because we subsequently need male_pheno to be an expression on the mt_linreg matrix table rather than mt. Lastly, the argument to root must be specified for both cases – otherwise the ‘Male’ output will overwrite the ‘Female’ output.

The second approach uses the aggregators.group_by() and aggregators.linreg() aggregators. The aggregation expression generates a dictionary where a key is a group (value of the grouping variable) and the corresponding value is the linear regression statistics for those samples in the group. The result of the aggregation expression is then used to annotate the matrix table.

The linear_regression_rows() method is more efficient than the aggregators.linreg() aggregator and can be extended to multiple phenotypes, but the aggregators.linreg() aggregator is more flexible (multiple covariates can be vary by entry) and returns a richer set of statistics.

PLINK Conversions

Polygenic Score Calculation

plink:

>>> plink --bfile data --score scores.txt sum 

tags:

PRS

description:

This command is analogous to plink’s –score command with the sum option. Biallelic variants are required.

code:

>>> mt = hl.import_plink(
...     bed="data/ldsc.bed", bim="data/ldsc.bim", fam="data/ldsc.fam",
...     quant_pheno=True, missing='-9')
>>> mt = hl.variant_qc(mt)
>>> scores = hl.import_table('data/scores.txt', delimiter=' ', key='rsid',
...                          types={'score': hl.tfloat32})
>>> mt = mt.annotate_rows(**scores[mt.rsid])
>>> flip = hl.case().when(mt.allele == mt.alleles[0], True).when(
...     mt.allele == mt.alleles[1], False).or_missing()
>>> mt = mt.annotate_rows(flip=flip)
>>> mt = mt.annotate_rows(
...     prior=2 * hl.if_else(mt.flip, mt.variant_qc.AF[0], mt.variant_qc.AF[1]))
>>> mt = mt.annotate_cols(
...     prs=hl.agg.sum(
...         mt.score * hl.coalesce(
...             hl.if_else(mt.flip, 2 - mt.GT.n_alt_alleles(),
...                        mt.GT.n_alt_alleles()), mt.prior)))

dependencies:

import_plink(), variant_qc(), import_table(), coalesce(), case(), cond(), Call.n_alt_alleles()

Genetics

Formatting

Convert variants in string format to separate locus and allele fields

Liftover variants from one coordinate system to another

Filtering and Pruning

Remove related individuals from a dataset

Filter loci by a list of locus intervals

From a table of intervals

From a UCSC BED file

Using hl.filter_intervals

Declaring intervals with hl.parse_locus_interval

Pruning Variants in Linkage Disequilibrium

Analysis

Linear Regression

Single Phenotype

Multiple Phenotypes

Using Variants (SNPs) as Covariates

Stratified by Group

PLINK Conversions

Polygenic Score Calculation

Using `hl.filter_intervals`

Declaring intervals with `hl.parse_locus_interval`