=========================== Use Hail on Google Dataproc =========================== Requirements ------------ Before you can run Hail on Dataproc, you must have both Hail and the Google Cloud CLI installed on your machine. Installing Hail --------------- First, install Hail on your `Mac OS X `__ or `Linux `__ laptop or desktop. The Hail pip package includes a tool called ``hailctl dataproc`` which starts, stops, and manipulates Hail-enabled Dataproc clusters. Installing and configuring the Google Cloud SDK ----------------------------------------------- We recommend that you follow the `Google Cloud SDK documentation `__ to install the Google Cloud SDK. You will need to configure your Google Cloud SDK after installation. This is the time to set up your Google Cloud project and billing, if you don't already have one. Running Hail on Dataproc requires passing in a Dataproc region. If you'd like to set your Dataproc region globally, you can do so by running: .. code-block:: sh gcloud config set dataproc/region Otherwise, you can set your Dataproc region using the `hailctl` `--region` command line flag. Starting your first Dataproc cluster ------------------------------------ Start a dataproc cluster named "my-first-cluster". Cluster names may only contain a mix lowercase letters and dashes. Starting a cluster can take as long as two minutes. .. code-block:: sh hailctl dataproc start my-first-cluster Create a file called "hail-script.py" and place the following analysis of a randomly generated dataset with five-hundred samples and half-a-million variants. .. code-block:: python3 import hail as hl mt = hl.balding_nichols_model(n_populations=3, n_samples=500, n_variants=500_000, n_partitions=32) mt = mt.annotate_cols(drinks_coffee = hl.rand_bool(0.33)) gwas = hl.linear_regression_rows(y=mt.drinks_coffee, x=mt.GT.n_alt_alleles(), covariates=[1.0]) gwas.order_by(gwas.p_value).show(25) Submit the analysis to the cluster and wait for the results. You should not have to wait more than a minute. .. code-block:: sh hailctl dataproc submit my-first-cluster hail-script.py When the script is done running you'll see 25 rows of variant association results. You can also start a Jupyter Notebook running on the cluster: .. code-block:: sh hailctl dataproc connect my-first-cluster notebook When you are finished with the cluster stop it: .. code-block:: sh hailctl dataproc stop my-first-cluster Next Steps """""""""" - Read more about Hail on `Google Cloud <../cloud/google_cloud.rst>`__ - Get the `Hail cheatsheets <../cheatsheets.rst>`__ - Follow the Hail `GWAS Tutorial <../tutorials/01-genome-wide-association-study.rst>`__