{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Table Tutorial\n", "\n", "[Table](https://hail.is/docs/0.2/hail.Table.html) is Hail's distributed analogue of a data frame or SQL table. It will be familiar if you've used R or `pandas`, but `Table` differs in 3 important ways:\n", "\n", "- It is distributed. Hail tables can store far more data than can fit on a single computer.\n", "- It carries global fields.\n", "- It is keyed.\n", "\n", "A `Table` has two different kinds of fields:\n", "\n", "- global fields\n", "- row fields" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Importing and Reading\n", "\n", "Hail can [import](https://hail.is/docs/0.2/methods/impex.html) data from many sources: TSV and CSV files, JSON files, FAM files, databases, Spark, etc. It can also *read* (and *write*) a native Hail format.\n", "\n", "You can read a dataset with [hl.read_table](https://hail.is/docs/0.2/methods/impex.html#hail.methods.read_table). It take a path and returns a `Table`. `ht` stands for Hail Table.\n", "\n", "We've provided a method to download and import [the MovieLens dataset](https://grouplens.org/datasets/movielens/100k/) of movie ratings in the Hail native format. Let's read it!\n", "\n", "F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=https://dx.doi.org/10.1145/2827872." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import hail as hl\n", "hl.init()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "hl.utils.get_movie_lens('data/')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "users = hl.read_table('data/users.ht')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Exploring Tables\n", "\n", "The [describe](https://hail.is/docs/0.2/hail.Table.html#hail.Table.describe) method prints the structure of a table: the fields and their types." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "users.describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can view the first few rows of the table using [show](https://hail.is/docs/0.2/hail.Table.html#hail.Table.show).\n", "\n", "10 rows are displayed by default. Try changing the code in the cell below to `users.show(5)`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "users.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can [count](https://hail.is/docs/0.2/hail.Table.html#hail.Table.count) the rows of a table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "users.count()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "You can access fields of tables with the Python attribute notation `table.field`, or with index notation `table['field']`. The latter is useful when the field names are not valid Python identifiers (if a field name includes a space, for example)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "users.occupation.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "users['occupation'].describe()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "`users.occupation` and `users['occupation']` are [Hail Expressions](https://hail.is/docs/0.2/expressions.html)\n", "\n", "Lets peak at their using `show`. Notice that the key is shown as well!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "users.occupation.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Exercise\n", "\n", "The movie dataset has two other tables: `movies.ht` and `ratings.ht`. Load these tables and have a quick look around." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }