Change Log And Version Policy
Python Version Compatibility Policy
Hail complies with NumPy’s compatibility policy on Python versions. In particular, Hail officially supports:
All minor versions of Python released 42 months prior to the project, and at minimum the two latest minor versions.
All minor versions of numpy released in the 24 months prior to the project, and at minimum the last three minor versions.
Frequently Asked Questions
With a version like 0.x, is Hail ready for use in publications?
Yes. The semantic versioning standard uses 0.x (development) versions to refer to software that is either “buggy” or “partial”. While we don’t view Hail as particularly buggy (especially compared to one-off untested scripts pervasive in bioinformatics!), Hail 0.2 is a partial realization of a larger vision.
What is the difference between the Hail Python library version and the native file format version?
The Hail Python library version, the version you see on
PyPI, in pip, or in
hl.version() changes every time we release the Python library. The
Hail native file format version only changes when we change the format
of Hail Table and MatrixTable files. If a version of the Python library
introduces a new native file format version, we note that in the change
log. All subsequent versions of the Python library can read the new file
format version.
The native file format changes much slower than the Python library version. It is not currently possible to view the file format version of a Hail Table or MatrixTable.
What stability is guaranteed?
The Hail file formats and Python API are backwards compatible. This means that a script developed to run on Hail 0.2.5 should continue to work in every subsequent release within the 0.2 major version. This also means any file written by python library versions 0.2.1 through 0.2.5 can be read by 0.2.5.
Forward compatibility of file formats and the Python API is not guaranteed. In particular, a new file format version is only readable by library versions released after the file format. For example, Python library version 0.2.119 introduces a new file format version: 1.7.0. All library versions before 0.2.119, for example 0.2.118, cannot read file format version 1.7.0. All library versions after and including 0.2.119 can read file format version 1.7.0.
Each version of the Hail Python library can only write files using the latest file format version it supports.
The hl.experimental package and other methods marked experimental in the docs are exempt from this policy. Their functionality or even existence may change without notice. Please contact us if you critically depend on experimental functionality.
Version 0.2.136
Released 2025-08-26
New Features
(#14918) Upgrade default Python version to 3.11
(#14877) Adds vds.read_dense_mt, which is equivalent to
read_vdsfollowed byto_dense_mt, but much more efficient, requiring only a single pass over the vds, instead of two.(#14966) Fix a memory leak in BlockMatrix.diagonal in QoB.
Deprecations
(#14918) Removes support for Python <= 3.9
Version 0.2.135
Released 2025-06-26
New Features
Bug Fixes
(#14905) Fix an error when importing PLINK files with very large numbers of variants
(#14913) Fix a bug that appears as a MatchError of class TDict
(#14907) Fix a bug that caused FileNotFound exceptions when converting between tables and spark dataframes.
(#14869) Fix a bug in the optimizer that incorrectly removed round-trip casts, eg float->int->float
(#14857) Fix a rare bug in the optimizer that produces invalid IR and most likely manifests as
assertion failed: type mismatchwhich. This bug can only occur in certain cases at the very beginning of a hail session (right after hl.init).
Version 0.2.134
Released 2025-02-25
New Features
(#14675) VDS now uses a
LENfield for reference block size, rather than the oldENDfield, in order to align with VCF version 4.5. Reading a VDS will automatically make sure that bothLENandENDare present to not break existing code. Writing a VDS with drop the now superfluousENDfield.(#14743) Add
vds.export_vcfandvds.import_vcfmethods to import and export SVCR VCFs.(#14806) Add
hail.query_matrix_table_rows, a matrix table analogue tohail.query_table.
Bug Fixes
Version 0.2.133
Released 2024-09-25
New Features
(#14619) Teach
hailctl dataproc submitto use the--projectargument as an argument togcloud dataprocrather than the submitted script.
Bug Fixes
(#14673) Fix typo in Interpret rule for
TableAggregate.(#14697) Set
QUAL="."to missing rather than htsjdk’s sentinel value.(#14292) Prevent GCS cold storage check from throwing an error when reading from a public access bucket.
(#14651) Remove jackson string length restriction for all backends.
(#14653) Add
--public-ip-addressargument togcloud dataproc startcommand built byhailctl dataproc start, fixing creation of dataproc 2.2 clusters.
Version 0.2.132
Released 2024-07-08
New Features
(#14572) Added
StringExpression.findfor finding substrings in a Hail str.
Bug Fixes
(#14574) Fixed
TypeErrorbug when initializing Hail Query withbackend='batch'.(#14571) Fixed a deficiency that caused certain pipelines that construct Hail
NDArrays from streams to run out of memory.(#14579) Fix serialization bug that broke some Query-on-Batch pipelines with many complex expressions.
(#14567) Fix Jackson configuration that broke some Query-on-Batch pipelines with many complex expressions.
Version 0.2.131
Released 2024-05-30
New Features
(#14560) The gvcf import stage of the VDS combiner now preserves the GT of reference blocks. Some datasets have haploid calls on sex chromosomes, and the fact that the reference was haploid should be preserved.
Bug Fixes
(#14563) The version of
notebookinstalled in Hail Dataproc clusters has been upgraded from 6.5.4 to 6.5.6 in order to fix a bug where Jupyter Notebooks wouldn’t start on clusters. The workaround involving creating a cluster with--packages='ipython<8.22'is no longer necessary.
Deprecations
(#14158) Hail now supports and primarily tests against Dataproc 2.2.5, Spark 3.5.0, and Java 11. We strongly recommend updating to Spark 3.5.0 and Java 11. You should also update your GCS connector after installing Hail:
curl https://broad.io/install-gcs-connector | python3. Do not try to update before installing Hail 0.2.131.
Version 0.2.130
Released 2024-10-02
0.2.129 contained test configuration artifacts that prevented users from
starting dataproc clusters with hailctl. Please upgrade to 0.2.130
if you use dataproc.
New Features
(hail##14447) Added
copy_spark_log_on_errorinitialization flag that when set, copies the hail driver log to the remotetmpdirif query execution raises an exception.
Bug Fixes
(#14452) Fixes a bug that prevents users from starting dataproc clusters with hailctl
Version 0.2.129
Released 2024-04-02
Documentation
New Features
(#14406) Performance improvements for reading structured data from (Matrix)Tables
(#14255) Added Cochran-Hantel-Haenszel test for association (
cochran_mantel_haenszel_test). Our thanks to @Will-Tyler for generously contributing this feature.(#14393)
haildepends onprotobufno longer; users may choose their own version ofprotobuf.(#14360) Exposed previously internal
_num_allele_typeasnumeric_allele_typeand deprecated it. Add newAlleleTypeenumeration for users to be able to easily use the values returned bynumeric_allele_type.(#14297)
vds.sample_gcnow uses independent aggregators. Users may now import these functions and use them directly.(#14405)
VariantDataset.validatenow checks that all ref blocks are no longer than the ref_block_max_length field, if it exists.
Bug Fixes
(#14420) Fixes a serious, but likely rare, bug in the Table/MatrixTable reader, which has been present since Sep 2020. It manifests as many (around half or more) of the rows being dropped. This could only happen when 1) reading a (matrix)table whose partitioning metadata allows rows with the same key to be split across neighboring partitions, and 2) reading it with a different partitioning than it was written. 1) would likely only happen by reading data keyed by locus and alleles, and rekeying it to only locus before writing. 2) would likely only happen by using the
_intervalsor_n_partitionsarguments toread_(matrix)_table, or possiblyrepartition. Please reach out to us if you’re concerned you may have been affected by this.(#14330) Fixes erroneous error in
export_vcfwith unphased haploid Calls.(#14303) Fix missingness error when sampling entries from a MatrixTable.
(#14288) Contigs may now be compared for inquality while filtering rows.
Deprecations
(#14386)
MatrixTable.make_tableis deprecated. Use.localize_entriesinstead.
Version 0.2.128
Released 2024-02-16
In GCP, the Hail Annotation DB and Datasets API have moved from multi-regional US and EU buckets to regional US-CENTRAL1 and EUROPE-WEST1 buckets. These buckets are requester pays which means unless your cluster is in the US-CENTRAL1 or EUROPE-WEST1 region, you will pay a per-gigabyte rate to read from the Annotation DB or Datasets API. We must make this change because reading from a multi-regional bucket into a regional VM is no longer free. Unfortunately, cost constraints require us to choose only one region per continent and we have chosen US-CENTRAL1 and EUROPE-WEST1.
Documentation
New Features
(#14206) Introduce
hailctl config set http/timeout_in_secondswhich Batch and QoB users can use to increase the timeout on their laptops. Laptops tend to have flaky internet connections and a timeout of 300 seconds produces a more robust experience.(#14178) Reduce VDS Combiner runtime slightly by computing the maximum ref block length without executing the combination pipeline twice.
(#14207) VDS Combiner now verifies that every GVCF path and sample name is unique.
Bug Fixes
(#14300) Require orjson<3.9.12 to avoid a segfault introduced in orjson 3.9.12
(#14071) Use indexed VEP cache files for GRCh38 on both dataproc and QoB.
(#14232) Allow use of large numbers of fields on a table without triggering
ClassTooLargeException: Class too large:.(#14246)(#14245) Fix a bug, introduced in 0.2.114, in which
Table.multi_way_zip_joinandTable.aggregate_by_keycould throw “NoSuchElementException: Ref with name__iruid_...” when one or more of the tables had a number of partitions substantially different from the desired number of output partitions.(#14202) Support coercing
{}(the empty dictionary) into any Struct type (with all missing fields).(#14239) Remove an erroneous statement from the MatrixTable tutorial.
(#14176)
hailtop.fs.lscan now list a bucket, e.g.hailtop.fs.ls("gs://my-bucket").(#14258) Fix
import_avroto not raiseNullPointerExceptionin certain rare cases (e.g. when using_key_by_assert_sorted).(#14285) Fix a broken link in the MatrixTable tutorial.
Deprecations
(#14293) Support for the
hail-az://scheme, deprecated in 0.2.116, is now gone. Please use the standardhttps://ACCOUNT.blob.core.windows.net/CONTAINER/PATH.
Version 0.2.127
Released 2024-01-12
If you have an Apple M1 laptop, verify that
file $JAVA_HOME/bin/java
returns a message including the phrase “arm64”. If it instead includes the phrase “x86_64” then you must upgrade to a new version of Java. You may find such a version of Java here.
New Features
Bug Fixes
(#14110) Fix
hailctl hdinsight start, which has been broken since 0.2.118.(#14098)(#14090)(#14118) Fix (#14089), which makes
hailctl dataproc connectwork in Windows Subsystem for Linux.(#14048) Fix (#13979), affecting Query-on-Batch and manifesting most frequently as “com.github.luben.zstd.ZstdException: Corrupted block detected”.
(#14066) Since 0.2.110,
hailctl dataprocset the heap size of the driver JVM dangerously high. It is now set to an appropriate level. This issue manifests in a variety of inscrutable ways including RemoteDisconnectedError and socket closed. See issue (#13960) for details.(#14057) Fix (#13998) which appeared in 0.2.58 and prevented reading from a networked filesystem mounted within the filesystem of the worker node for certain pipelines (those that did not trigger “lowering”).
(#14006) Fix (#14000). Hail now supports identity_by_descent on Apple M1 and M2 chips; however, your Java installation must be an arm64 installation. Using x86_64 Java with Hail on Apple M1 or M2 will cause SIGILL errors. If you have an Apple M1 or Apple M2 and
/usr/libexec/java_home -Vdoes not include(arm64), you must switch to an arm64 version of the JVM.(#14022) Fix (#13937) caused by faulty library code in the Google Cloud Storage API Java client library.
(#13812) Permit
hailctl batch submitto accept relative paths. Fix (#13785).(#13885) Hail Query-on-Batch previously used Class A Operations for all interaction with blobs. This change ensures that QoB only uses Class A Operations when necessary.
(#14127)
hailctl dataproc start ... --dry-runnow uses shell escapes such that, after copied and pasted into a shell, thegcloudcommand works as expected.(#14062) Fix (#14052) which caused incorrect results for identity by descent in Query-on-Batch.
(#14122) Ensure that stack traces are transmitted from workers to the driver to the client.
(#14105) When a VCF contains missing values in array fields, Hail now suggests using
array_elements_required=False.
Deprecations
(#13987) Deprecate
default_referenceparameter tohl.init, users should usehl.default_referencewith an argument to set new default references usually shortly afterhl.init.
Version 0.2.126
Released 2023-10-30
Bug Fixes
(#13939) Fix a bug introduced in 0.2.125 which could cause dict literals created in python to be decoded incorrectly, causing runtime errors or, potentially, incorrect results.
(#13751) Correct the broadcasting of ndarrays containing at least one dimension of length zero. This previously produced incorrect results.
Version 0.2.125
Released 2023-10-26
New Features
(#13682)
hl.export_vcfnow clearly reports all Table or Matrix Table fields which cannot be represented in a VCF.(#13355) Improve the Hail compiler to more reliably rewrite
Table.filterandMatrixTable.filter_rowsto usehl.filter_intervals. Before this change some queries required reading all partitions even though only a small number of partitions match the filter.(#13787) Improve speed of reading hail format datasets from disk. Simple pipelines may see as much as a halving in latency.
(#13849) Fix (#13788), improving the error message when
hl.logistic_regression_rowsis provided row or entry annotations for the dependent variable.(#13888)
hl.default_referencecan now be passed an argument to change the default reference genome.
Bug Fixes
(#13702) Fix (#13699) and (#13693). Since 0.2.96, pipelines that combined random functions (e.g.
hl.rand_unif) withindex(..., all_matches=True)could fail with aClassCastException.(#13707) Fix (#13633).
hl.maximum_independent_setnow accepts strings as the names of individuals. It has always accepted structures containing a single string field.(#13713) Fix (#13704), in which Hail could encounter an IllegalArgumentException if there are too many transient errors.
(#13730) Fix (#13356) and (#13409). In QoB pipelines with 10K or more partitions, transient “Corrupted block detected” errors were common. This was caused by incorrect retry logic. That logic has been fixed.
(#13732) Fix (#13721) which manifested with the message “Missing Range header in response”. The root cause was a bug in the Google Cloud Storage SDK on which we rely. The fix is to update to a version without this bug. The buggy version of GCS SDK was introduced in 0.2.123.
(#13759) Since Hail 0.2.123, Hail would hang in Dataproc Notebooks due to (#13690).
(#13755) Ndarray concatenation now works with arrays with size zero dimensions.
(#13817) Mitigate new transient error from Google Cloud Storage which manifests as
aiohttp.client_exceptions.ClientOSError: [Errno 1] [SSL: SSLV3_ALERT_BAD_RECORD_MAC] sslv3 alert bad record mac (_ssl.c:2548).(#13715) Fix (#13697), a long standing issue with QoB. When a QoB driver or worker fails, the corresponding Batch Job will also appear as failed.
(#13829) Fix (#13828). The Hail combiner now properly imports
PGTfields from GVCFs.(#13805) Fix (#13767).
hailctl dataproc submitnow expands~in the--filesand--pyfilesarguments.(#13797) Fix (#13756). Operations that collect large results such as
to_pandasmay require up to 3x less memory.(#13826) Fix (#13793). Ensure
hailctl describe -uoverrides thegcs_requester_pays/projectconfig variable.(#13814) Fix (#13757). Pipelines that are memory-bound by copious use of
hl.literal, such ashl.vds.filter_intervals, require substantially less memory.(#13894) Fix (#13837) in which Hail could break a Spark installation if the Hail JAR appears on the classpath before the Scala JARs.
(#13919) Fix (#13915) which prevented using a glob pattern in
hl.import_vcf.
Version 0.2.124
Released 2023-09-21
New Features
(#13608) Change default behavior of hl.ggplot.geom_density to use a new method. The old method is still available using the flag smoothed=True. The new method is typically a much more accurate representation, and works well for any distribution, not just smooth ones.
Version 0.2.123
Released 2023-09-19
New Features
(#13610) Additional setup is no longer required when using hail.plot or hail.ggplot in a Jupyter notebook (calling bokeh.io.output_notebook or hail.plot.output_notebook and/or setting plotly.io.renderers.default = ‘iframe’ is no longer necessary).
Bug Fixes
(#13634) Fix a bug which caused Query-on-Batch pipelines with a large number of partitions (close to 100k) to run out of memory on the driver after all partitions finish.
(#13619) Fix an optimization bug that, on some pipelines, since at least 0.2.58 (commit 23813af), resulted in Hail using essentially unbounded amounts of memory.
(#13609) Fix a bug in hail.ggplot.scale_color_continuous that sometimes caused errors by generating invalid colors.
Version 0.2.122
Released 2023-09-07
New Features
(#13508) The n parameter of MatrixTable.tail is deprecated in favor of a new n_rows parameter.
Bug Fixes
(#13498) Fix a bug where field names can shadow methods on the StructExpression class, e.g. “items”, “keys”, “values”. Now the only way to access such fields is through the getitem syntax, e.g. “some_struct[‘items’]”. It’s possible this could break existing code that uses such field names.
(#13585) Fix bug introduced in 0.2.121 where Query-on-Batch users could not make requests to
batch.hail.iswithout a domain configuration set.
Version 0.2.121
Released 2023-09-06
New Features
(#13385) The VDS combiner now supports arbitrary custom call fields via the
call_fieldsparameter.(#13224)
hailctl config get,set, andunsetnow support shell auto-completion. Runhailctl --install-completion zshto install the auto-completion forzsh. You must already have completion enabled forzsh.(#13279) Add
hailctl batch initwhich helps new users interactively set uphailctlfor Query-on-Batch and Batch use.
Bug Fixes
(#13573) Fix (#12936) in which VEP frequently failed (due to Docker not starting up) on clusters with a non-trivial number of workers.
(#13485) Fix (#13479) in which
hl.vds.local_to_globalcould produce invalid values when the LA field is too short. There were and are no issues when the LA field has the correct length.(#13340) Fix
copy_logto correctly copy relative file paths.(#13364)
hl.import_gvcf_intervalnow treatsPGTas a call field.(#13333) Fix interval filtering regression:
filter_rowsorfiltermentioning the same field twice or using two fields incorrectly read the entire dataset. In 0.2.121, these filters will correctly read only the relevant subset of the data.(#13368) In Azure, Hail now uses fewer “list blobs” operations. This should reduce cost on pipelines that import many files, export many of files, or use file glob expressions.
(#13414) Resolves (#13407) in which uses of
union_rowscould reduce parallelism to one partition resulting in severely degraded performance.(#13405)
MatrixTable.aggregate_colsno longer forces a distributed computation. This should be what you want in the majority of cases. In case you know the aggregation is very slow and should be parallelized, usemt.cols().aggregateinstead.(#13460) In Query-on-Spark, restore
hl.read_tableoptimization that avoids reading unnecessary data in pipelines that do not reference row fields.(#13447) Fix (#13446). In all three submit commands (
batch,dataproc, andhdinsight), Hail now allows and encourages the use of – to separate arguments meant for the user script from those meant for hailctl. In hailctl batch submit, option-like arguments, for example “–foo”, are now supported before “–” if and only if they do not conflict with a hailctl option.(#13422)
hailtop.hail_frozenlist.frozenlistnow has an eval-ablerepr.(#13523)
hl.Structis now pickle-able.(#13505) Fix bug introduced in 0.2.117 by commit
c9de81108which prevented the passing of keyword arguments to Python jobs. This manifested as “ValueError: too many values to unpack”.(#13536) Fixed (#13535) which prevented the use of Python jobs when the client (e.g. your laptop) Python version is 3.11 or later.
(#13434) In QoB, Hail’s file systems now correctly list all files in a directory, not just the first 1000. This could manifest in an
import_tableorimport_vcfwhich used a glob expression. In such a case, only the first 1000 files would have been included in the resulting Table or MatrixTable.(#13550)
hl.utils.range_table(n)now supports all valid 32-bit signed integer values ofn.(#13500) In Query-on-Batch, the client-side Python code will not try to list every job when a QoB batch fails. This could take hours for long-running pipelines or pipelines with many partitions.
Deprecations
Version 0.2.120
Released 2023-07-27
New Features
(#13206) The VDS Combiner now works in Query-on-Batch.
Bug Fixes
(#13313) Fix bug introduced in 0.2.119 which causes a serialization error when using Query-on-Spark to read a VCF which is sorted by locus, with split multi-allelics, in which the records sharing a single locus do not appear in the dictionary ordering of their alternate alleles.
(#13264) Fix bug which ignored the
partition_hintof a Table group-by-and-aggregate.(#13239) Fix bug which ignored the
HAIL_BATCH_REGIONSargument when determining in which regions to schedule jobs when using Query-on-Batch.(#13253) Improve
hadoop_lsandhfs.lsto quickly list globbed files in a directory. The speed improvement is proportional to the number of files in the directory.(#13226) Fix the comparison of an
hl.Structto anhl.structor field of typetstruct. Resolves (#13045) and (Hail#13046).(#12995) Fixed bug causing poor performance and memory leaks for
MatrixTable.annotate_rowsaggregations.
Version 0.2.119
Released 2023-06-28
New Features
(#12081) Hail now uses Zstandard as the default compression algorithm for table and matrix table storage. Reducing file size around 20% in most cases.
(#12988) Arbitrary aggregations can now be used on arrays via
ArrayExpression.aggregate. This method is useful for accessing functionality that exists in the aggregator library but not the basic expression library, for instance,call_stats.(#13166) Add an
eighndarray method, for finding eigenvalues of symmetric matrices (“h” is for Hermitian, the complex analogue of symmetric).
Bug Fixes
(#13184) The
vds.to_dense_mtno longer densifies past the end of contig boundaries. A logic bug into_dense_mtcould lead to reference data toward’s the end of one contig being applied to the following contig up until the first reference block of the contig.(#13173) Fix globbing in scala blob storage filesystem implementations.
File Format
The native file format version is now 1.7.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.118
Released 2023-06-13
New Features
Bug Fixes
(#13126) Query-on-Batch pipelines with one partition are now retried when they encounter transient errors.
(#13113)
hail.ggplot.geom_pointnow displays a legend group for a column even when it has only one value in it.(#13075) (#13074) Add a new transient error plaguing pipelines in Query-on-Batch in Google:
java.net.SocketTimeoutException: connect timed out.(#12569) The documentation for
hail.ggplot.facetsis now correctly included in the API reference.
Version 0.2.117
Released 2023-05-22
New Features
(#12875) Parallel export modes now write a manifest file. These manifest files are text files with one filename per line, containing name of each shard written successfully to the directory. These filenames are relative to the export directory.
(#13007) In Query-on-Batch and
hailtop.batch, memory and storage request strings may now be optionally terminated with aBfor bytes.
Bug Fixes
(#13065) In Azure Query-on-Batch, fix a resource leak that prevented running pipelines with >500 partitions and created flakiness with >250 partitions.
(#13067) In Query-on-Batch, driver and worker logs no longer buffer so messages should arrive in the UI after a fixed delay rather than proportional to the frequency of log messages.
(#13028) Fix crash in
hl.vds.filter_intervalswhen using a table to filter a VDS that stores the max ref block length.(#13060) Prevent 500 Internal Server Error in Jupyter Notebooks of Dataproc clusters started by
hailctl dataproc.(#13051) In Query-on-Batch and
hailtop.batch, Azure Blob StoragehttpsURLs are now supported.(#13042) In Query-on-Batch,
naive_coalesceno longer performs a full write/read of the dataset. It now operates identically to the Query-on-Spark implementation.(#13031) In
hl.ld_prune, an informative error message is raised when a dataset does not contain diploid calls instead of an assertion error.(#13032) In Query-on-Batch, in Azure, Hail now users a newer version of the Azure blob storage libraries to reduce the frequency of “Stream is already closed” errors.
(#13011) In Query-on-Batch, the driver will use ~1/2 as much memory to read results as it did in 0.2.115.
(#13013) In Query-on-Batch, transient errors while streaming from Google Storage are now automatically retried.
Version 0.2.116
Released 2023-05-08
New Features
(#12917) ABS blob URIs in the format of
https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>are now supported.(#12731) Introduced
hailtop.fsthat makes public a filesystem module that works for local fs, gs, s3 and abs. This is now used as theBackend.fsfor hail query but can be used standalone for Hail Batch users byimport hailtop.fs as hfs.
Deprecations
Bug Fixes
Version 0.2.115
Released 2023-04-25
New Features
(#12731) Introduced
hailtop.fsthat makes public a filesystem module that works for local fs, gs, s3 and abs. This can be used byimport hailtop.fs as hfsbut has also replaced the underlying implementation of thehl.hadoop_*methods. This means that thehl.hadoop_*methods now support these additional blob storage providers.(#12917) ABS blob URIs in the form of
https://<ACCOUNT_NAME>.blob.core.windows.net/<CONTAINER_NAME>/<PATH>are now supported when running in Azure.
Deprecations
(#12917) The
hail-azscheme for referencing ABS blobs in Azure is deprecated in favor of thehttpsscheme and will be removed in a future release.
Bug Fixes
(#12919) An interactive hail session is no longer unusable after hitting CTRL-C during a batch execution in Query-on-Batch
(#12913) Fixed bug in
hail.ggplotwhere all legend entries would have the same text if one column had exactly one value for all rows and was mapped to either theshapeor thecoloraesthetic forgeom_point.
Version 0.2.114
Released 2023-04-19
New Features
(#12880) Added
hl.vds.store_ref_block_max_lento patch old VDSes to make interval filtering faster.
Bug Fixes
(#12860) Fixed memory leak in shuffles in Query-on-Batch.
Version 0.2.113
Released 2023-04-07
New Features
(#12798) Query-on-Batch now supports
BlockMatrix.write(..., stage_locally=True).(#12793) Query-on-Batch now supports
hl.poisson_regression_rows.(#12801) Hitting CTRL-C while interactively using Query-on-Batch cancels the underlying batch.
(#12810)
hl.arraycan now convert 1-d ndarrays into the equivalent list.(#12851)
hl.variant_qcno longer requires a locus field.(#12816) In Query-on-Batch,
hl.logistic_regression('firth', ...)is now supported.(#12854) In Query-on-Batch, simple pipelines with large numbers of partitions should be substantially faster.
Bug Fixes
(#12783) Fixed bug where logs were not properly transmitted to Python.
(#12812) Fixed bug where
Table/MT._calculate_new_partitionsreturned unbalanced intervals with whole-stage code generation runtime.(#12839) Fixed
hailctl dataprocjupyter notebooks to be compatible with Spark 3.3, which have been broken since 0.2.110.(#12855) In Query-on-Batch, allow writing to requester pays buckets, which was broken before this release.
Version 0.2.112
Released 2023-03-15
Bug Fixes
(#12784) Removed an internal caching mechanism in Query on Batch that caused stalls in pipelines with large intermediates
Version 0.2.111
Released 2023-03-13
New Features
(#12581) In Query on Batch, users can specify which regions to have jobs run in.
Bug Fixes
(#12772) Fix
hailctl hdinsight submitto pass args to the files
Version 0.2.110
Released 2023-03-08
New Features
(#12643) In Query on Batch,
hl.skat(..., logistic=True)is now supported.(#12643) In Query on Batch,
hl.liftoveris now supported.(#12629) In Query on Batch,
hl.ibdis now supported.(#12722) Add
hl.simulate_random_matingto generate a population from founders under the assumption of random mating.(#12701) Query on Spark now officially supports Spark 3.3.0 and Dataproc 2.1.x
Performance Improvements
(#12679) In Query on Batch,
hl.balding_nichols_modelis slightly faster. Also addedhl.utils.genomic_range_tableto quickly create a table keyed by locus.
Bug Fixes
(#12711) In Query on Batch, fix null pointer exception (manifesting as
scala.MatchError: null) when reading data from requester pays buckets.(#12739) Fix
hl.plot.cdf,hl.plot.pdf, andhl.plot.joint_plotwhich were broken by changes in Hail and changes in bokeh.(#12735) Fix (#11738) by allowing user to override default types in
to_pandas.(#12760) Mitigate some JVM bytecode generation errors, particularly those related to too many method parameters.
(#12766) Fix (#12759) by loosening
parsimoniousdependency pin.(#12732) In Query on Batch, fix bug that sometimes prevented terminating a pipeline using Control-C.
(#12771) Use a version of
jgscmwhose version complies with PEP 440.
Version 0.2.109
Released 2023-02-08
New Features
(#12605) Add
hl.pgenchisqthe cumulative distribution function of the generalized chi-squared distribution.(#12637) Query-on-Batch now supports
hl.skat(..., logistic=False).(#12645) Added
hl.vds.truncate_reference_blocksto transform a VDS to checkpoint reference blocks in order to drastically improve interval filtering performance. Also addedhl.vds.merge_reference_blocksto merge adjacent reference blocks according to user criteria to better compress reference data.
Bug Fixes
(#12650) Hail will now throw an exception on
hl.export_bgenwhen there is no GP field, instead of exporting null records.(#12635) Fix bug where
hl.skatdid not work on Apple M1 machines.(#12571) When using Query-on-Batch, hl.hadoop* methods now properly support creation and modification time.
(#12566) Improve error message when combining incompatibly indexed fields in certain operations including array indexing.
Version 0.2.108
Released 2023-1-12
New Features
(#12576)
hl.import_bgenandhl.export_bgennow support compression with Zstd.
Bug fixes
(#12585)
hail.ggplots that have more than one legend group or facet are now interactive. If such a plot has enough legend entries that the legend would be taller than the plot, the legend will now be scrollable. Legend entries for such plots can be clicked to show/hide traces on the plot, but this does not work and is a known issue that will only be addressed ifhail.ggplotis migrated off of plotly.(#12584) Fixed bug which arose as an assertion error about type mismatches. This was usually triggered when working with tuples.
(#12583) Fixed bug which showed an empty table for
ht.col_key.show().(#12582) Fixed bug where matrix tables with duplicate col keys do not show properly. Also fixed bug where tables and matrix tables with HTML unsafe column headers are rendered wrong in Jupyter.
(#12574) Fixed a memory leak when processing tables. Could trigger unnecessarily high memory use and out of memory errors when there are many rows per partition or large key fields.
(#12565) Fixed a bug that prevented exploding on a field of a Table whose value is a random value.
Version 0.2.107
Released 2022-12-14
Bug fixes
(#12543) Fixed
hl.vds.local_to_globalerror when LA array contains non-ascending allele indices.
Version 0.2.106
Released 2022-12-13
New Features
(#12522) Added
hailctlconfig setting'batch/backend'to specify the default backend to use in batch scripts when not specified in code.(#12497) Added support for
scales,nrow, andncolarguments, as well as grouped legends, tohail.ggplot.facet_wrap.(#12471) Added
hailctl batch submitcommand to run local scripts inside batch jobs.(#12525) Add support for passing arguments to
hailctl batch submit.(#12465) Batch jobs’ status now contains the region the job ran in. The job itself can access which region it is in through the
HAIL_REGIONenvironment variable.(#12464) When using Query-on-Batch, all jobs for a single hail session are inserted into the same batch instead of one batch per action.
(#12457)
pcaandhwe_normalized_pcaare now supported in Query-on-Batch.(#12376) Added
hail.query_tablefunction for reading tables with indices from Python.(#12139) Random number generation has been updated, but shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
(#11884) Added
Job.always_copy_outputwhen using theServiceBackend. The default behavior isFalse, which is a breaking change from the previous behavior to always copy output files regardless of the job’s completion state.(#12139) Brand new random number generation, shouldn’t affect most users. If you need to manually set seeds, see https://hail.is/docs/0.2/functions/random.html for details.
Bug Fixes
(#12487) Fixed a bug causing rare but deterministic job failures deserializing data in Query-on-Batch.
(#12535) QoB will now error if the user reads from and writes to the same path. QoB also now respects the user’s configuration of
disable_progress_bar. Whendisable_progress_baris unspecified, QoB only disables the progress bar for non-interactive sessions.(#12517) Fix a performance regression that appears when using
hl.split_multi_htsamong other methods.
Version 0.2.105
Released 2022-10-31 🎃
New Features
(#12293) Added support for
hail.MatrixTables tohail.ggplot.
Bug Fixes
(#12384) Fixed a critical bug that disabled tree aggregation and scan executions in 0.2.104, leading to out-of-memory errors.
(#12265) Fix long-standing bug wherein
hl.agg.collect_as_setandhl.agg.countererror when applied to types which, in Python, are unhashable. For example,hl.agg.counter(t.list_of_genes)will not error whent.list_of_genesis a list. Instead, the counter dictionary will useFrozenListkeys from thefrozenlistpackage.
Version 0.2.104
Release 2022-10-19
New Features
(#12346): Introduced new progress bars which include total time elapsed and look cool.
Version 0.2.103
Release 2022-10-18
Bug Fixes
(#12305): Fixed a rare crash reading tables/matrixtables with _intervals
Version 0.2.102
Released 2022-10-06
New Features
(#12218) Missing values are now supported in primitive columns in
Table.to_pandas.(#12254) Cross-product-style legends for data groups have been replaced with factored ones (consistent with
ggplot2’s implementation) forhail.ggplot.geom_point, and support has been added for custom legend group labels.(#12268)
VariantDatasetnow implementsunion_rowsfor combining datasets with the same samples but disjoint variants.
Bug Fixes
Version 0.2.101
Released 2022-10-04
New Features
(#12218) Support missing values in primitive columns in
Table.to_pandas.(#12195) Add a
impute_sex_chr_ploidy_from_interval_coverageto impute sex ploidy directly from a coverage MT.(#12222) Query-on-Batch pipelines now add worker jobs to the same batch as the driver job instead of producing a new batch per stage.
(#12244) Added support for custom labels for per-group legends to
hail.ggplot.geom_pointvia thelegend_formatkeyword argument
Deprecations
(#12230) The python-dill Batch images in
gcr.io/hail-vdcare no longer supported. Usehailgenetics/python-dillinstead.
Bug fixes
(#12215) Fix search bar in the Hail Batch documentation.
Version 0.2.100
Released 2022-09-23
New Features
(#12207) Add support for the
shapeaesthetic tohail.ggplot.geom_point.
Deprecations
(#12213) The
batch_sizeparameter ofvds.new_combineris deprecated in favor ofgvcf_batch_size.
Bug fixes
Version 0.2.99
Released 2022-09-13
New Features
Performance Improvements
(#12159) Improve performance of MatrixTable reads when using
_intervalsargument
Bug fixes
Version 0.2.98
Released 2022-08-22
New Features
(#12062)
hl.balding_nichols_modelnow supports an optional boolean parameter,phased, to control the phasedness of the generated genotypes.
Performance improvements
Bug fixes
(#12115) When using
use_new_shuffle=True, fix a bug when there are more than 2^31 rows(#12074) Fix bug where
hl.initcould silently overwrite the global random seed.(#12079) Fix bug in handling of missing (aka NA) fields in grouped aggregation and distinct by key.
(#12056) Fix
hl.export_vcfto actually create tabix files when requested.(#12020) Fix bug in
hl.experimental.densifywhich manifested as anAssertionErrorabout dtypes.
Version 0.2.97
Released 2022-06-30
New Features
(#11756)
hb.BatchPoolExecutorand Python jobs both now also support async functions.
Bug fixes
(#11962) Fix error (logged as (#11891)) in VCF combiner when exactly 10 or 100 files are combined.
(#11969) Fix
import_tableandimport_linesto use multiple partitions whenforce_bgzis used.(#11964) Fix erroneous “Bucket is a requester pays bucket but no user project provided.” errors in Google Dataproc by updating to the latest Dataproc image version.
Version 0.2.96
Released 2022-06-21
New Features
(#11833)
hl.rand_unifnow has default arguments of 0.0 and 1.0
Bug fixes
(#11905) Fix erroneous FileNotFoundError in glob patterns
(#11921) and (#11910) Fix file clobbering during text export with speculative execution.
(#11920) Fix array out of bounds error when tree aggregating a multiple of 50 partitions.
(#11937) Fixed correctness bug in scan order for
Table.annotateandMatrixTable.annotate_rowsin certain circumstances.(#11887) Escape VCF description strings when exporting.
(#11886) Fix an error in an example in the docs for
hl.split_multi.
Version 0.2.95
Released 2022-05-13
New features
(#11809) Export
dtypes_from_pandasinexpr.types(#11807) Teach smoothed_pdf to add a plot to an existing figure.
(#11746) The ServiceBackend, in interactive mode, will print a link to the currently executing driver batch.
(#11759)
hl.logistic_regression_rows,hl.poisson_regression_rows, andhl.skatall now support configuration of the maximum number of iterations and the tolerance.(#11835) Add
hl.ggplot.geom_densitywhich renders a plot of an approximation of the probability density function of its argument.
Bug fixes
(#11815) Fix incorrectly missing entries in to_dense_mt at the position of ref block END.
(#11828) Fix
hl.initto not ignore itsscargument. This bug was introduced in 0.2.94.(#11830) Fix an error and relax a timeout which caused
hailtop.aiotools.copyto hang.(#11778) Fix a (different) error which could cause hangs in
hailtop.aiotools.copy.
Version 0.2.94
Released 2022-04-26
Deprecation
(#11765) Deprecated and removed linear mixed model functionality.
Beta features
(#11782)
hl.import_tableis up to twice as fast for small tables.
New features
hailctl dataproc
(#11710) support pass-through arguments to
connect
Bug fixes
(#11792) Resolved issue where corrupted tables could be created with whole-stage code generation enabled.
Version 0.2.93
Release 2022-03-27
Beta features
Several issues with the beta version of Hail Query on Hail Batch are addressed in this release.
Version 0.2.92
Release 2022-03-25
New features
(#11613) Add
hl.ggplotsupport forscale_fill_hue,scale_color_hue, andscale_fill_manual,scale_color_manual. This allows for an infinite number of discrete colors.(#11608) Add all remaining and all versions of extant public gnomAD datasets to the Hail Annotation Database and Datasets API. Current as of March 23rd 2022.
(#11662) Add the
weightaestheticgeom_bar.
Beta features
This version of Hail includes all the necessary client-side infrastructure to execute Hail Query pipelines on a Hail Batch cluster. This effectively enables a “serverless” version of Hail Query which is independent of Apache Spark. Broad affiliated users should contact the Hail team for help using Hail Query on Hail Batch. Unaffiliated users should also contact the Hail team to discuss the feasibility of running your own Hail Batch cluster. The Hail team is accessible at both https://hail.zulipchat.com and https://discuss.hail.is .
Version 0.2.91
Release 2022-03-18
Bug fixes
(#11614) Update
hail.utils.tutorial.get_movie_lensto usehttpsinstead ofhttp. Movie Lens has stopped serving data over insecure HTTP.(#11563) Fix issue hail-is/hail#11562.
(#11611) Fix a bug that prevents the display of
hl.ggplot.geom_hlineandhl.ggplot.geom_vline.
Version 0.2.90
Release 2022-03-11
Critical BlockMatrix from_numpy correctness bug
(#11555)
BlockMatrix.from_numpydid not work correctly. Version 1.0 of org.scalanlp.breeze, a dependency of Apache Spark that hail also depends on, has a correctness bug that results in BlockMatrices that repeat the top left block of the block matrix for every block. This affected anyone running Spark 3.0.x or 3.1.x.
Bug fixes
(#11556) Fixed assertion error ocassionally being thrown by valid joins where the join key was a prefix of the left key.
Versioning
(#11551) Support Python 3.10.
Version 0.2.89
Release 2022-03-04
(#11452) Fix
impute_sex_chromosome_ploidydocs.
Version 0.2.88
Release 2022-03-01
This release addresses the deploy issues in the 0.2.87 release of Hail.
Version 0.2.87
Release 2022-02-28
An error in the deploy process required us to yank this release from PyPI. Please do not use this release.
Bug fixes
(#11401) Fixed bug where
from_pandasdidn’t support missing strings.
Version 0.2.86
Release 2022-02-25
Bug fixes
Performance improvements
(#11306) Newly written tables that have no duplicate keys will be faster to join against.
Version 0.2.85
Release 2022-02-14
Bug fixes
New features
(#11332) Added
geom_ribbonandgeom_areato hail ggplot.
Version 0.2.84
Release 2022-02-10
Bug fixes
(#11328) Fix bug where occasionally files written to disk would be unreadable.
(#11331) Fix bug that potentially caused files written to disk to be unreadable.
(#11312) Fix aggregator memory leak.
(#11340) Fix bug where repeatedly annotating same field name could cause failure to compile.
(#11342) Fix to possible issues about having too many open file handles.
New features
Version 0.2.83
Release 2022-02-01
Bug fixes
New features
(#11274) Added
geom_coltohail.ggplot.
hailctl dataproc
(#11280) Updated dataproc image version to one not affected by log4j vulnerabilities.
Version 0.2.82
Release 2022-01-24
Bug fixes
(#11209) Significantly improved usefulness and speed of
Table.to_pandas, resolved several bugs with output.
New features
Performance Improvements
(#11216) Significantly improve performance of
parse_locus_interval
Python and Java Support
File Format
The native file format version is now 1.6.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.81
Release 2021-12-20
hailctl dataproc
(#11182) Updated Dataproc image version to mitigate yet more Log4j vulnerabilities.
Version 0.2.80
Release 2021-12-15
New features
(#11077)
hl.experimental.write_matrix_tablesnow returns the paths of the written matrix tables.
hailctl dataproc
(#11157) Updated Dataproc image version to mitigate the Log4j vulnerability.
(#10900) Added
--regionparameter tohailctl dataproc submit.(#11090) Teach
hailctl dataproc describehow to read URLs with the protocolss3(Amazon S3),hail-az(Azure Blob Storage), andfile(local file system) in addition togs(Google Cloud Storage).
Version 0.2.79
Release 2021-11-17
Bug fixes
(#11023) Fixed bug in call decoding that was introduced in version 0.2.78.
New features
(#10993) New function
p_value_excess_het.
Version 0.2.78
Release 2021-10-19
Bug fixes
New features
(#10855) Arbitrary aggregations can be implemented using
hl.agg.fold.
Performance Improvements
(#10971) Substantially improve the speed of
Table.collectwhen collecting large amounts of data.
Version 0.2.77
Release 2021-09-21
Bug fixes
Version 0.2.76
Released 2021-09-15
Bug fixes
Version 0.2.75
Released 2021-09-10
Bug fixes
(#10733) Fix a bug in tabix parsing when the size of the list of all sequences is large.
(#10765) Fix rare bug where valid pipelines would fail to compile if intervals were created conditionally.
(#10746) Various compiler improvements, decrease likelihood of
ClassTooLargeerrors.(#10829) Fix a bug where
hl.missingandCaseBuilder.or_errorfailed if their type was a struct containing a field starting with a number.
New features
(#10768) Support multiplying
StringExpressions to repeat them, as with normal python strings.
Performance improvements
Version 0.2.74
Released 2021-07-26
Bug fixes
Version 0.2.73
Released 2021-07-22
Bug fixes
Version 0.2.72
Released 2021-07-19
New Features
Bug fixes
Version 0.2.71
Released 2021-07-08
New Features
Bug fixes
hailctl dataproc
(#10633) Added
--scopesparameter tohailctl dataproc start.
Version 0.2.70
Released 2021-06-21
Version 0.2.69
Released 2021-06-14
New Features
Bug fixes
hailctl dataproc
(#10574) Hail logs will now be stored in
/home/hailby default.
Version 0.2.68
Released 2021-05-27
Version 0.2.67
Critical performance fix
Released 2021-05-06
(#10451) Fixed a memory leak / performance bug triggered by
hl.literal(...).contains(...)
Version 0.2.66
Released 2021-05-03
New features
Version 0.2.65
Released 2021-04-14
Default Spark Version Change
Starting from version 0.2.65, Hail uses Spark 3.1.1 by default. This will also allow the use of all python versions >= 3.6. By building hail from source, it is still possible to use older versions of Spark.
New features
Performance improvements
(#10233) Loops created with
hl.experimental.loopwill now clean up unneeded memory between iterations.
Bug fixes
(#10227)
hl.nd.qrnow supports ndarrays that have 0 rows or columns.
Version 0.2.64
Released 2021-03-11
New features
(#10164) Add source_file_field parameter to hl.import_table to allow lines to be associated with their original source file.
Bug fixes
(#10182) Fixed serious memory leak in certain uses of
filter_intervals.(#10133) Fix bug where some pipelines incorrectly infer missingness, leading to a type error.
(#10134) Teach
hl.kingto treat filtered entries as missing values.(#10158) Fixes hail usage in latest versions of jupyter that rely on
asyncio.(#10174) Fixed bad error message when incorrect return type specified with
hl.loop.
Version 0.2.63
Released 2021-03-01
(#10105) Hail will now return
frozensetandhail.utils.frozendictinstead of normal sets and dicts.
Bug fixes
Performance Improvements
Version 0.2.62
Released 2021-02-03
New features
(#9936) Deprecated
hl.nullin favor ofhl.missingfor naming consistency.(#9973)
hl.vepnow includes avep_proc_idfield to aid in debugging unexpected output.(#9839) Hail now eagerly deletes temporary files produced by some BlockMatrix operations.
(#9835)
hl.anyandhl.allnow also support a single collection argument and a varargs of Boolean expressions.(#9816)
hl.pc_relatenow includes values on the diagonal of kinship, IBD-0, IBD-1, and IBD-2(#9736) Let NDArrayExpression.reshape take varargs instead of mandating a tuple.
(#9766)
hl.export_vcfnow warns if INFO field names are invalid according to the VCF 4.3 spec.
Bug fixes
(#9976) Fixed
show()representation of Hail dictionaries.
Performance improvements
(#9909) Improved performance of
hl.experimental.densifyby approximately 35%.
Version 0.2.61
Released 2020-12-03
New features
(#9749) Add or_error method to SwitchBuilder (
hl.switch)
Bug fixes
Version 0.2.60
Released 2020-11-16
New features
(#9696)
hl.experimental.export_elasticsearchwill now support Elasticsearch versions 6.8 - 7.x by default.
Bug fixes
(#9641) Showing hail ndarray data now always prints in correct order.
hailctl dataproc
(#9610) Support interval fields in
hailctl dataproc describe
Version 0.2.59
Released 2020-10-22
Datasets / Annotation DB
(#9605) The Datasets API and the Annotation Database now support AWS, and users are required to specify what cloud platform they’re using.
hailctl dataproc
(#9609) Fixed bug where
hailctl dataproc modifydid not correctly print correspondinggcloudcommand.
Version 0.2.58
Released 2020-10-08
New features
(#9524) Hail should now be buildable using Spark 3.0.
(#9549) Add
ignore_in_sample_frequencyflag tohl.de_novo.(#9501) Configurable cache size for
BlockMatrix.to_matrix_table_row_majorandBlockMatrix.to_table_row_major.(#9474) Add
ArrayExpression.firstandArrayExpression.last.(#9459) Add
StringExpression.join, an analogue to Python’sstr.join.(#9398) Hail will now throw
HailUserErrors if theor_errorbranch of aCaseBuilderis hit.
Bug fixes
(#9503) NDArrays can now hold arbitrary data types, though only ndarrays of primitives can be collected to Python.
(#9501) Remove memory leak in
BlockMatrix.to_matrix_table_row_majorandBlockMatrix.to_table_row_major.(#9424)
hl.experimental.writeBlockMatricesdidn’t correctly supportoverwriteflag.
Performance improvements
(#9506)
hl.agg.ndarray_sumwill now do a tree aggregation.
hailctl dataproc
Deprecations
(#9482)
ArrayExpression.headhas been deprecated in favor ofArrayExpression.first.
Version 0.2.57
Released 2020-09-03
New features
(#9343) Implement the KING method for relationship inference as
hl.methods.king.
Version 0.2.56
Released 2020-08-31
New features
Performance
Bug fixes
(#9304) Fix crash in
run_combinercaused by inputs where VCF lines and BGZ blocks align.
hailctl dataproc
Version 0.2.55
Released 2020-08-19
Performance
(#9264) Table.checkpoint now uses a faster LZ4 compression scheme.
Bug fixes
(#9250)
hailctl dataprocno longer uses deprecatedgcloudflags. Consequently, users must update to a recent version ofgcloud.(#9294) The “Python 3” kernel in notebooks in clusters started by
hailctl dataprocnow features the same Spark monitoring widget found in the “Hail” kernel. There is now no reason to use the “Hail” kernel.
File Format
The native file format version is now 1.5.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.54
Released 2020-08-07
VCF Combiner
New features
(#9209) Add
hl.agg.ndarray_sumaggregator.
Bug fixes
Version 0.2.53
Released 2020-07-30
Bug fixes
Version 0.2.52
Released 2020-07-29
Bug fixes
Version 0.2.51
Released 2020-07-28
Bug fixes
Version 0.2.50
Released 2020-07-23
Bug fixes
(#9114) CHANGELOG: Fixed crash when using repeated calls to
hl.filter_intervals.
New features
Version 0.2.49
Released 2020-07-08
Bug fixes
(#9058) Fixed memory leak affecting
Table.aggregate,MatrixTable.annotate_colsaggregations, andhl.sample_qc.
Version 0.2.48
Released 2020-07-07
Bug fixes
(#9029) Fix crash when using
hl.agg.linregwith no aggregated data records.(#9028) Fixed memory leak affecting
Table.annotatewith scans,hl.experimental.densify, andTable.group_by/aggregate.(#8978) Fixed aggregation behavior of
MatrixTable.{group_rows_by, group_cols_by}to skip filtered entries.
Version 0.2.47
Released 2020-06-23
Bug fixes
Version 0.2.46
Released 2020-06-17
Site
(#8955) Natural language documentation search
Bug fixes
(#8981) Fix BlockMatrix OOM triggered by the MatrixWriteBlockMatrix WriteBlocksRDD method
Version 0.2.45
Release 2020-06-15
Bug fixes
hailctl dataproc
Version 0.2.44
Release 2020-06-06
New Features
Bug fixes
(#8883) Fix an issue related to failures in pipelines with
force_bgz=True.
Performance
(#8887) Substantially improve the performance of
hl.experimental.import_gtf.
Version 0.2.43
Released 2020-05-28
Bug fixes
Version 0.2.42
Released 2020-05-27
New Features
Bug fixes
Version 0.2.41
Released 2020-05-15
Bug fixes
hailctl dataproc
(#8790) Use configured compute zone as default for
hailctl dataproc connectandhailctl dataproc modify.
Version 0.2.40
Released 2020-05-12
VCF Combiner
(#8706) Add option to key by both locus and alleles for final output.
Bug fixes
Version 0.2.39
Released 2020-04-29
Bug fixes
(#8615) Fix contig ordering in the CanFam3 (dog) reference genome.
(#8622) Fix bug that causes inscrutable JVM Bytecode errors.
(#8645) Ease unnecessarily strict assertion that caused errors when aggregating by key (e.g.
hl.experimental.spread).(#8621)
hl.nd.arraynow supports arrays with no elements (e.g.hl.nd.array([]).reshape((0, 5))) and, consequently, matmul with an inner dimension of zero.
New features
(#8571)
hl.init(skip_logging_configuration=True)will skip configuration of Log4j. Users may use this to configure their own logging.(#8588) Users who manually build Python wheels will experience less unnecessary output when doing so.
(#8572) Add
hl.parse_jsonwhich converts a string containing JSON into a Hail object.
Performance Improvements
Documentation
Version 0.2.38
Released 2020-04-21
Critical Linreg Aggregator Correctness Bug
(#8575) Fixed a correctness bug in the linear regression aggregator. This was introduced in version 0.2.29. See https://discuss.hail.is/t/possible-incorrect-linreg-aggregator-results-in-0-2-29-0-2-37/1375 for more details.
Performance improvements
(#8558) Make
hl.experimental.export_entries_by_colmore fault tolerant.
Version 0.2.37
Released 2020-04-14
Bug fixes
(#8487) Fix incorrect handling of badly formatted data for
hl.gp_dosage.(#8497) Fix handling of missingness for
hl.hamming.(#8537) Fix compile-time errror.
(#8539) Fix compiler error in
Table.multi_way_zip_join.(#8488) Fix
hl.agg.call_statsto appropriately throw an error for badly-formatted calls.
New features
(#8327) Attempting to write to the same file being read from in a pipeline will now throw an error instead of corrupting data.
Version 0.2.36
Released 2020-04-06
Critical Memory Management Bug Fix
(#8463) Reverted a change (separate to the bug in 0.2.34) that led to a memory leak in version 0.2.35.
Bug fixes
Version 0.2.35
Released 2020-04-02
Critical Memory Management Bug Fix
(#8412) Fixed a serious per-partition memory leak that causes certain pipelines to run out of memory unexpectedly. Please update from 0.2.34.
New features
(#8404) Added “CanFam3” (a reference genome for dogs) as a bundled reference genome.
Bug fixes
Performance Improvements
hailctl dataproc
Version 0.2.34
Released 2020-03-12
New features
Bug fixes
hailctl dataproc
(#8253)
hailctl dataprocnow supports new flags--requester-pays-allow-alland--requester-pays-allow-buckets. This will configure your hail installation to be able to read from requester pays buckets. The charges for reading from these buckets will be billed to the project that the cluster is created in.(#8268) The data sources for VEP have been moved to
gs://hail-us-vep,gs://hail-eu-vep, andgs://hail-uk-vep, which are requester-pays buckets in Google Cloud.hailctl dataprocwill automatically infer which of these buckets you should pull data from based on the region your cluster is spun up in. If you are in none of those regions, please contact us on discuss.hail.is.
File Format
The native file format version is now 1.4.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.33
Released 2020-02-27
New features
(#8173) Added new method
hl.zeros.
Bug fixes
(#8153) Fixed complier bug causing
MatchErrorinimport_bgen.(#8123) Fixed an issue with multiple Python HailContexts running on the same cluster.
(#8150) Fixed an issue where output from VEP about failures was not reported in error message.
(#8152) Fixed an issue where the row count of a MatrixTable coming from
import_matrix_tablewas incorrect.(#8175) Fixed a bug where
persistdid not actually do anything.
hailctl dataproc
(#8079) Using
connectto open the jupyter notebook browser will no longer crash if your project contains requester-pays buckets.
Version 0.2.32
Released 2020-02-07
Critical performance regression fix
(#7989) Fixed performance regression leading to a large slowdown when
hl.variant_qcwas run after filtering columns.
Performance
Bug fixes
(#7976) Fixed divide-by-zero error in
hl.concordancewith no overlapping rows or cols.(#7965) Fixed optimizer error leading to crashes caused by
MatrixTable.union_rows.(#8035) Fix compiler bug in
Table.multi_way_zip_join.(#8021) Fix bug in computing shape after
BlockMatrix.filter.(#7986) Fix error in NDArray matrix/vector multiply.
New features
(#8007) Add
hl.nd.diagonalfunction.
Cheat sheets
Version 0.2.31
Released 2020-01-22
New features
(#7787) Added transition/transversion information to
hl.summarize_variants.(#7792) Add Python stack trace to array index out of bounds errors in Hail pipelines.
(#7832) Add
spark_confargument tohl.init, permitting configuration of Spark runtime for a Hail session.(#7823) Added datetime functions
hl.experimental.strptimeandhl.experimental.strftime.(#7888) Added
hl.nd.arrayconstructor from nested standard arrays.
File size
(#7923) Fixed compression problem since 0.2.23 resulting in larger-than-expected matrix table files for datasets with few entry fields (e.g. GT-only datasets).
Performance
Bug fixes
Version 0.2.30
Released 2019-12-20
Performance
New features
(#7614) Added experimental support for loops with
hl.experimental.loop.
Miscellaneous
(#7745) Changed
export_vcfto only use scientific notation when necessary.
Version 0.2.29
Released 2019-12-17
Bug fixes
(#7229) Fixed
hl.maximal_independent_settie breaker functionality.(#7732) Fixed incompatibility with old files leading to incorrect data read when filtering intervals after
read_matrix_table.(#7642) Fixed crash when constant-folding functions that throw errors.
(#7611) Fixed
hl.hadoop_lsto handle glob patterns correctly.(#7653) Fixed crash in
ld_pruneby unfiltering missing GTs.
Performance improvements
New features
(#7686) Added
commentargument toimport_matrix_table, allowing lines with certain prefixes to be ignored.(#7688) Added experimental support for
NDArrayExpressions in newhl.ndmodule.(#7608)
hl.grepnow has ashowargument that allows users to either print the results (default) or return a dictionary of the results.
hailctl dataproc
(#7717) Throw error when mispelling arguments instead of silently quitting.
Version 0.2.28
Released 2019-11-22
Critical correctness bug fix
(#7588) Fixes a bug where filtering old matrix tables in newer versions of hail did not work as expected. Please update from 0.2.27.
Bug fixes
New Features
hailctl dataproc
(#7586)
hailctl dataprocnow supports--gcloud_configurationoption.
Documentation
(#7570) Hail has a cheatsheet for Tables now.
Version 0.2.27
Released 2019-11-15
New Features
(#7379) Add
delimiterargument tohl.import_matrix_table(#7389) Add
forceandforce_bgzarguments tohl.experimental.import_gtf(#7467) Added
hl.if_elseas an alias forhl.cond; deprecatedhl.cond.(#7453) Add
hl.parse_int{32, 64}andhl.parse_float{32, 64}, which can parse strings to numbers and return missing on failure.(#7475) Add
row_join_typeargument toMatrixTable.union_colsto support outer joins on rows.
Bug fixes
hailctl dataproc
(#7460) The Spark monitor widget now automatically collapses after a job completes.
Version 0.2.26
Released 2019-10-24
New Features
Bug Fixes
(#7361) Fix
ADcalculation insparse_split_multi.
Performance Improvements
(#7355) Improve performance of IR copying.
File Format
The native file format version is now 1.3.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.25
Released 2019-10-14
New features
(#7240) Add interactive schema widget to
{MatrixTable, Table}.describe. Use this by passing the argumentwidget=True.(#7250)
{Table, MatrixTable, Expression}.summarize()now summarizes elements of collections (arrays, sets, dicts).(#7271) Improve
hl.plot.qqby increasing point size, adding the unscaled p-value to hover data, and printing lambda-GC on the plot.(#7280) Add HTML output for
{Table, MatrixTable, Expression}.summarize().(#7294) Add HTML output for
hl.summarize_variants().
Bug fixes
Performance improvements
File Format
The native file format version is now 1.2.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.24
Released 2019-10-03
hailctl dataproc
(#7185) Resolve issue in dependencies that led to a Jupyter update breaking cluster creation.
New features
(#7071) Add
permit_shuffleflag tohl.{split_multi, split_multi_hts}to allow processing of datasets with both multiallelics and duplciate loci.(#7121) Add
hl.contig_lengthfunction.(#7130) Add
windowmethod onLocusExpression, which creates an interval around a locus.(#7172) Permit
hl.init(sc=sc)with pip-installed packages, given the right configuration options.
Bug fixes
Version 0.2.23
Released 2019-09-23
hailctl dataproc
Bug fixes
New features
(#7009) Introduced analysis pass in Python that mostly obviates the
hl.bindandhl.rbindoperators; idiomatic Python that generates Hail expressions will perform much better.(#7076) Improved memory management in generated code, add additional log statements about allocated memory to improve debugging.
(#7085) Warn only once about schema mismatches during JSON import (used in VEP, Nirvana, and sometimes
import_table.(#7106)
hl.agg.call_statscan now accept a number of alleles for itsallelesparameter, useful when dealing with biallelic calls without the alleles array at hand.
Performance
Version 0.2.22
Released 2019-09-12
New features
(#7013) Added
contig_recodingtoimport_bedandimport_locus_intervals.
Performance
hailctl dataproc
(#7003) Pass through extra arguments for
hailctl dataproc listandhailctl dataproc stop.
Version 0.2.21
Released 2019-09-03
Bug fixes
New features
Performance
hailctl dataproc
Version 0.2.20
Released 2019-08-19
Critical memory management fix
(#6824) Fixed memory management inside
annotate_colswith aggregations. This was causing memory leaks and segfaults.
Bug fixes
New features
(#6847) Added
hl.nanminandhl.nanmaxfunctions.
Version 0.2.19
Released 2019-08-01
Critical performance bug fix
Bug fixes
(#6757) Fixed correctness bug in optimizations applied to the combination of
Table.order_bywithhl.descarguments andshow(), leading to tables sorted in ascending, not descending order.(#6770) Fixed assertion error caused by
Table.expand_types(), which was used byTable.to_sparkandTable.to_pandas.
Performance Improvements
(#6666) Slightly improve performance of
hl.pcaandhl.hwe_normalized_pca.(#6669) Improve performance of
hl.split_multiandhl.split_multi_hts.(#6644) Optimize core code generation primitives, leading to across-the-board performance improvements.
(#6775) Fixed a major performance problem related to reading block matrices.
hailctl dataproc
(#6760) Fixed the address pointed at by
uiinconnect, after Google changed proxy settings that rendered the UI URL incorrect. Also added new addresshist/spark-history.
Version 0.2.18
Released 2019-07-12
Critical performance bug fix
(#6605) Resolved code generation issue leading a performance regression of 1-3 orders of magnitude in Hail pipelines using constant strings or literals. This includes almost every pipeline! This issue has exists in versions 0.2.15, 0.2.16, and 0.2.17, and any users on those versions should update as soon as possible.
Bug fixes
(#6598) Fixed code generated by
MatrixTable.unfilter_entriesto improve performance. This will slightly improve the performance ofhwe_normalized_pcaand relatedness computation methods, which useunfilter_entriesinternally.
Version 0.2.17
Released 2019-07-10
New features
(#6349) Added
compressionparameter toexport_block_matrices, which can be'gz'or'bgz'.(#6405) When a matrix table has string column-keys,
matrixtable.showuses the column key as the column name.(#6345) Added an improved scan implementation, which reduces the memory load on master.
(#6462) Added
export_bgenmethod.(#6473) Improved performance of
hl.agg.array_sumby about 50%.(#6498) Added method
hl.lambda_gcto calculate the genomic control inflation factor.(#6456) Dramatically improved performance of pipelines containing long chains of calls to
Table.annotate, orMatrixTableequivalents.(#6506) Improved the performance of the generated code for the
Table.annotate(**thing)pattern.
Bug fixes
(#6404) Added
n_rowsandn_colsparameters toExpression.showfor consistency with othershowmethods.(#6408)(#6419) Fixed an issue where the
filter_intervalsoptimization could make scans return incorrect results.(#6459)(#6458) Fixed rare correctness bug in the
filter_intervalsoptimization which could result too many rows being kept.(#6496) Fixed html output of
showmethods to truncate long field contents.(#6478) Fixed the broken documentation for the experimental
approx_cdfandapprox_quantilesaggregators.(#6504) Fix
Table.showcollecting data twice while running in Jupyter notebooks.(#6571) Fixed the message printed in
hl.concordanceto print the number of overlapping samples, not the full list of overlapping sample IDs.(#6583) Fixed
hl.plot.manhattanfor non-default reference genomes.
Experimental
(#6488) Exposed
table.multi_way_zip_join. This takes a list of tables of identical types, and zips them together into one table.
File Format
The native file format version is now 1.1.0. Older versions of Hail will not be able to read tables or matrix tables written by this version of Hail.
Version 0.2.16
Released 2019-06-19
hailctl
(#6357) Accommodated Google Dataproc bug causing cluster creation failures.
Bug fixes
(#6378) Fixed problem in how
entry_float_typewas being handled inimport_vcf.
Version 0.2.15
Released 2019-06-14
After some infrastructural changes to our development process, we should be getting back to frequent releases.
hailctl
Starting in 0.2.15, pip installations of Hail come bundled with a
command- line tool, hailctl. This tool subsumes the functionality of
cloudtools, which is now deprecated. See the release thread on the
forum
for more information.
New features
(#5932)(#6115)
hl.import_bedabdhl.import_locus_intervalsnow accept keyword arguments to pass through tohl.import_table, which is used internally. This permits parameters likemin_partitionsto be set.(#5980) Added
logoption tohl.plot.histogram2d.(#5937) Added
all_matchesparameter toTable.indexandMatrixTable.index_{rows, cols, entries}, which produces an array of all rows in the indexed object matching the index key. This makes it possible to, for example, annotate all intervals overlapping a locus.(#5913) Added functionality that makes arrays of structs easier to work with.
(#6089) Added HTML output to
Expression.showwhen running in a notebook.(#6172)
hl.split_multi_htsnow uses the originalGQvalue if thePLis missing.(#6123) Added
hl.binary_searchto search sorted numeric arrays.(#6224) Moved implementation of
hl.concordancefrom backend to Python. Performance directly fromread()is slightly worse, but inside larger pipelines this function will be optimized much better than before, and it will benefit improvements to general infrastructure.(#6214) Updated Hail Python dependencies.
(#5979) Added optimizer pass to rewrite filter expressions on keys as interval filters where possible, leading to massive speedups for point queries. See the blog post for examples.
Bug fixes
(#5895) Fixed crash caused by
-0.0floating-point values inhl.agg.hist.(#6013) Turned off feature in HTSJDK that caused crashes in
hl.import_vcfdue to header fields being overwritten with different types, if the field had a different type than the type in the VCF 4.2 spec.(#6117) Fixed problem causing
Table.flatten()to be quadratic in the size of the schema.(#6228)(#5993) Fixed
MatrixTable.union_rows()to join distinct keys on the right, preventing an unintentional cartesian product.(#6235) Fixed an issue related to aggregation inside
MatrixTable.filter_cols.(#6226) Restored lost behavior where
Table.show(x < 0)shows the entire table.(#6267) Fixed cryptic crashes related to
hl.split_multiandMatrixTable.entries()with duplicate row keys.
Version 0.2.14
Released 2019-04-24
A back-incompatible patch update to PySpark, 2.4.2, has broken fresh pip installs of Hail 0.2.13. To fix this, either downgrade PySpark to 2.4.1 or upgrade to the latest version of Hail.
New features
Version 0.2.13
Released 2019-04-18
Hail is now using Spark 2.4.x by default. If you build hail from source, you will need to acquire this version of Spark and update your build invocations accordingly.
New features
(#5828) Remove dependency on htsjdk for VCF INFO parsing, enabling faster import of some VCFs.
(#5860) Improve performance of some column annotation pipelines.
(#5858) Add
unifyoption toTable.unionwhich allows unification of tables with different fields or field orderings.(#5799)
mt.entries()is four times faster.(#5756) Hail now uses Spark 2.4.x by default.
(#5677)
MatrixTablenow also supportsshow.(#5793)(#5701) Add
array.index(x)which find the first index ofarraywhose value is equal tox.(#5790) Add
array.head()which returns the first element of the array, or missing if the array is empty.(#5690) Improve performance of
ld_matrix.(#5743)
mt.compute_entry_filter_statscomputes statistics about the number of filtered entries in a matrix table.(#5758) failure to parse an interval will now produce a much more detailed error message.
(#5723)
hl.import_matrix_tablecan now import a matrix table with no columns.(#5724)
hl.rand_norm2dsamples from a two dimensional random normal.
Bug fixes
(#5885) Fix
Table.to_sparkin the presence of fields of tuples.(#5882)(#5886) Fix
BlockMatrixconversion methods to correctly handle filtered entries.(#5884)(#4874) Fix longstanding crash when reading Hail data files under certain conditions.
(#5855)(#5786) Fix
hl.mendel_errorsincorrectly reporting children counts in the presence of entry filtering.(#5773) Fix
hl.sample_qcto use correct number of total rows when calculating call rate.(#5763)(#5764) Fix
hl.agg.array_aggto work insidemt.annotate_rowsand similar functions.(#5770) Hail now uses the correct unicode string encoding which resolves a number of issues when a Table or MatrixTable has a key field containing unicode characters.
(#5692) When
keyedisTrue,hl.maximal_independent_setnow does not produce duplicates.(#5725) Docs now consistently refer to
hl.aggnotagg.(#5730)(#5782) Taught
import_bgento optimize itsvariantsargument.
Experimental
Version 0.2.12
Released 2019-03-28
New features
Bug fixes
Experimental
(#5524) Add
summarizefunctions to Table, MatrixTable, and Expression.(#5570) Add
hl.agg.approx_cdfaggregator for approximate density calculation.(#5571) Add
logparameter tohl.plot.histogram.(#5601) Add
hl.plot.joint_plot, extend functionality ofhl.plot.scatter.(#5608) Add LD score simulation framework.
(#5628) Add
hl.experimental.full_outer_join_mtfor full outer joins onMatrixTables.
Version 0.2.11
Released 2019-03-06
New features
(#5374) Add default arguments to
hl.add_sequencefor running on GCP.(#5481) Added
sample_colsmethod toMatrixTable.(#5501) Exposed
MatrixTable.unfilter_entries. Seefilter_entriesdocumentation for more information.(#5480) Added
n_colsargument toMatrixTable.head.(#5529) Added
Table.{semi_join, anti_join}andMatrixTable.{semi_join_rows, semi_join_cols, anti_join_rows, anti_join_cols}.(#5528) Added
{MatrixTable, Table}.checkpointmethods as wrappers aroundwrite/read_{matrix_table, table}.
Bug fixes
(#5416) Resolved issue wherein VEP and certain regressions were recomputed on each use, rather than once.
(#5419) Resolved issue with
import_vcfforce_bgzand file size checks.(#5427) Resolved issue with
Table.showand dictionary field types.(#5468) Resolved ordering problem with
Expression.showon key fields that are not the first key.(#5492) Fixed
hl.agg.collectcrashing when collectingfloat32values.(#5525) Fixed
hl.trio_matrixcrashing whencomplete_triosisFalse.
Version 0.2.10
Released 2019-02-15
New features
(#5272) Added a new ‘delimiter’ option to Table.export.
(#5251) Add utility aliases to
hl.plotforoutput_notebookandshow.(#5249) Add
histogram2dfunction tohl.plotmodule.(#5247) Expose
MatrixTable.localize_entriesmethod for converting to a Table with an entries array.(#5300) Add new
filterandfind_replacearguments tohl.import_tableandhl.import_vcfto apply regex and substitutions to text input.
Performance improvements
(#5298) Reduce size of exported VCF files by exporting missing genotypes without trailing fields.
Bug fixes
(#5306) Fix
ReferenceGenome.add_sequencecausing a crash.(#5268) Fix
Table.exportwriting a file called ‘None’ in the current directory.(#5265) Fix
hl.get_referenceraising an exception when called beforehl.init().(#5250) Fix crash in
pc_relatewhen called on a MatrixTable field other than ‘GT’.(#5278) Fix crash in
Table.order_bywhen sorting by fields whose names are not valid Python identifiers.(#5294) Fix crash in
hl.trio_matrixwhen sample IDs are missing.(#5295) Fix crash in
Table.indexrelated to key field incompatibilities.
Version 0.2.9
Released 2019-01-30
New features
Performance improvements
Bug fixes
(#5144) Fix crash caused by
hl.index_bgen(since 0.2.7)(#5177) Fix bug causing
Table.repartition(n, shuffle=True)to fail to increase partitioning for unkeyed tables.(#5173) Fix bug causing
Table.showto throw an error when the table is empty (since 0.2.8).(#5210) Fix bug causing
Table.showto always print types, regardless oftypesargument (since 0.2.8).(#5211) Fix bug causing
MatrixTable.make_tableto unintentionally discard non-key row fields (since 0.2.8).
Version 0.2.8
Released 2019-01-15
New features
Performance improvements
Bug fixes
Version 0.2.7
Released 2019-01-03
New features
(#5046)(experimental) Added option to BlockMatrix.export_rectangles to export as NumPy-compatible binary.
Performance improvements
(#5050) Short-circuit iteration in
logistic_regression_rowsandpoisson_regression_rowsif NaNs appear.
Version 0.2.6
Released 2018-12-17
New features
(#4962) Expanded comparison operators (
==,!=,<,<=,>,>=) to support expressions of every type.(#4927) Expanded functionality of
Table.order_byto support ordering by arbitrary expressions, instead of just top-level fields.(#4926) Expanded default GRCh38 contig recoding behavior in
import_plink.
Performance improvements
Bug fixes
(#4941) Fixed variable scoping error in regression methods.
(#4857) Fixed bug in maximal_independent_set appearing when nodes were named something other than
iandj.(#4932) Fixed possible error in
export_plinkrelated to tolerance of writer process failure.(#4920) Fixed bad error message in
Table.order_by.
Version 0.2.5
Released 2018-12-07
New features
(#4845) The or_error method in
hl.caseandhl.switchstatements now takes a string expression rather than a string literal, allowing more informative messages for errors and assertions.(#4865) We use this new
or_errorfunctionality in methods that require biallelic variants to include an offending variant in the error message.(#4820) Added hl.reversed for reversing arrays and strings.
(#4895) Added
include_strandoption to the hl.liftover function.
Performance improvements
Bug fixes
(#4754)(#4799) Fixed optimizer assertion errors related to certain types of pipelines using
group_rows_by.(#4888) Fixed assertion error in BlockMatrix.sum.
(#4871) Fixed possible error in locally sorting nested collections.
(#4889) Fixed break in compatibility with extremely old MatrixTable/Table files.
(#4527)(#4761) Fixed optimizer assertion error sometimes encountered with
hl.split_multi[_hts].
Version 0.2.4: Beginning of history!
We didn’t start manually curating information about user-facing changes until version 0.2.4.
The full commit history is available here.