{ "cells": [ { "cell_type": "markdown", "id": "3343aabc", "metadata": {}, "source": [ "## GGPlot Tutorial\n" ] }, { "cell_type": "code", "execution_count": null, "id": "popular-budapest", "metadata": {}, "outputs": [], "source": [ "import hail as hl\n", "from hail.ggplot import *\n", "\n", "import plotly" ] }, { "cell_type": "markdown", "id": "controlling-class", "metadata": {}, "source": [ "The Hail team has implemented a plotting module for hail based on the very popular `ggplot2` package from R's tidyverse. That library is very fully featured and we will never be quite as flexible as it, but with just a subset of its functionality we can make highly customizable plots." ] }, { "cell_type": "markdown", "id": "literary-geology", "metadata": {}, "source": [ "### The Grammar of Graphics\n", "\n", "The key idea here is that there's not one magic function to make the plot you want. Plots are built up from a set of core primitives that allow for extensive customization. Let's start with an example. We are going to plot y = x^2 for x from 0 to 10. First we make a hail table representing that data:" ] }, { "cell_type": "code", "execution_count": null, "id": "opponent-financing", "metadata": {}, "outputs": [], "source": [ "ht = hl.utils.range_table(10)\n", "ht = ht.annotate(squared = ht.idx**2)" ] }, { "cell_type": "markdown", "id": "floppy-dodge", "metadata": {}, "source": [ "Every plot starts with a call to `ggplot`, and then requires adding a `geom` to specify what kind of plot you'd like to create." ] }, { "cell_type": "code", "execution_count": null, "id": "negative-chart", "metadata": {}, "outputs": [], "source": [ "fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_line()\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "elect-warner", "metadata": {}, "source": [ "`aes` creates an \"aesthetic mapping\", which maps hail expressions to aspects of the plot. There is a predefined list of aesthetics supported by every `geom`. Most take an `x` and `y` at least. \n", "\n", "With this interface, it's easy to change out our plotting representation separate from our data. We can plot bars:" ] }, { "cell_type": "code", "execution_count": null, "id": "innovative-peninsula", "metadata": {}, "outputs": [], "source": [ "fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_col()\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "sustained-brook", "metadata": {}, "source": [ "Or points:" ] }, { "cell_type": "code", "execution_count": null, "id": "random-inquiry", "metadata": {}, "outputs": [], "source": [ "fig = ggplot(ht, aes(x=ht.idx, y=ht.squared)) + geom_point()\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "skilled-tiger", "metadata": {}, "source": [ "There are optional aesthetics too. If we want, we could color the points based on whether they're even or odd:" ] }, { "cell_type": "code", "execution_count": null, "id": "spatial-camera", "metadata": {}, "outputs": [], "source": [ "fig = ggplot(ht, aes(x=ht.idx, y=ht.squared, color=hl.if_else(ht.idx % 2 == 0, \"even\", \"odd\"))) + geom_point()\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "blessed-brighton", "metadata": {}, "source": [ "Note that the `color` aesthetic by default just takes in an expression that evaluates to strings, and it assigns a discrete color to each string.\n", "\n", "Say we wanted to plot the line with the colored points overlayed on top of it. We could try:" ] }, { "cell_type": "code", "execution_count": null, "id": "middle-nirvana", "metadata": {}, "outputs": [], "source": [ "fig = (ggplot(ht, aes(x=ht.idx, y=ht.squared, color=hl.if_else(ht.idx % 2 == 0, \"even\", \"odd\"))) + \n", " geom_line() + \n", " geom_point() \n", " )\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "material-empire", "metadata": {}, "source": [ "But that is coloring the line as well, causing us to end up with interlocking blue and orange lines, which isn't what we want. For that reason, it's possible to define aesthetics that only apply to certain geoms." ] }, { "cell_type": "code", "execution_count": null, "id": "activated-apollo", "metadata": {}, "outputs": [], "source": [ "fig = (ggplot(ht, aes(x=ht.idx, y=ht.squared)) + \n", " geom_line() + \n", " geom_point(aes(color=hl.if_else(ht.idx % 2 == 0, \"even\", \"odd\"))) \n", " )\n", "fig.show()" ] }, { "cell_type": "markdown", "id": "australian-paradise", "metadata": {}, "source": [ "All geoms can take in their own aesthetic mapping, which lets them specify aesthetics specific to them. And `geom_point` still inherits the `x` and `y` aesthetics from the mapping defined in `ggplot()`." ] }, { "cell_type": "markdown", "id": "animal-cleveland", "metadata": {}, "source": [ "### Geoms that group\n", "\n", "Some geoms implicitly do an aggregation based on the `x` aesthetic, and so don't take a `y` value. Consider this dataset from gapminder with information about countries around the world, with one datapoint taken per country in the years 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, and 2007." ] }, { "cell_type": "code", "execution_count": null, "id": "cardiovascular-paraguay", "metadata": {}, "outputs": [], "source": [ "gp = hl.Table.from_pandas(plotly.data.gapminder())\n", "gp.describe()" ] }, { "cell_type": "markdown", "id": "democratic-proxy", "metadata": {}, "source": [ "Let's filter the data to 2007 for our first experiments" ] }, { "cell_type": "code", "execution_count": null, "id": "identical-living", "metadata": {}, "outputs": [], "source": [ "gp_2007 = gp.filter(gp.year == 2007)" ] }, { "cell_type": "markdown", "id": "sharp-honey", "metadata": {}, "source": [ "If we want to see how many countries from each continent we have, we can use `geom_bar`, which just takes in an x aesthetic and then implicitly counts how many values of each x there are." ] }, { "cell_type": "code", "execution_count": null, "id": "historic-navigation", "metadata": {}, "outputs": [], "source": [ "ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar()" ] }, { "cell_type": "markdown", "id": "dental-democrat", "metadata": {}, "source": [ "To make it a little prettier, let's color per continent as well. We use `fill` to specify color of shapes (as opposed to `color` for points and lines. `color` on a bar chart sets the color of the bar outline.)" ] }, { "cell_type": "code", "execution_count": null, "id": "religious-latvia", "metadata": {}, "outputs": [], "source": [ "ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar(aes(fill=gp_2007.continent))" ] }, { "cell_type": "markdown", "id": "powerful-jesus", "metadata": {}, "source": [ "Maybe we instead want to see not the number of countries per continent, but the number of people living on each continent. We can do this with `geom_bar` as well by specifying a weight." ] }, { "cell_type": "code", "execution_count": null, "id": "prescription-worry", "metadata": {}, "outputs": [], "source": [ "ggplot(gp_2007, aes(x=gp_2007.continent)) + geom_bar(aes(fill=gp_2007.continent, weight=gp_2007.pop))" ] }, { "cell_type": "markdown", "id": "turned-convenience", "metadata": {}, "source": [ "Histograms are similar to bar plots, except they break a continuous x axis into bins. Let's import the `iris` dataset for this.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "collective-parts", "metadata": {}, "outputs": [], "source": [ "iris = hl.Table.from_pandas(plotly.data.iris())\n", "iris.describe()" ] }, { "cell_type": "markdown", "id": "julian-blank", "metadata": {}, "source": [ "Let's make a histogram:" ] }, { "cell_type": "code", "execution_count": null, "id": "private-preliminary", "metadata": {}, "outputs": [], "source": [ "ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram()" ] }, { "cell_type": "markdown", "id": "australian-explosion", "metadata": {}, "source": [ "By default histogram plots groups stacked on top of each other, which is not always easy to interpret. We can specify the `position` argument to histogram to get different behavior. `\"dodge\"` puts the bars next to each other:" ] }, { "cell_type": "code", "execution_count": null, "id": "needed-skirt", "metadata": {}, "outputs": [], "source": [ "ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram(position=\"dodge\")" ] }, { "cell_type": "markdown", "id": "continental-compatibility", "metadata": {}, "source": [ "And `\"identity\"` plots them over each other. It helps to set an `alpha` value to make them slightly transparent in these cases" ] }, { "cell_type": "code", "execution_count": null, "id": "sensitive-vertical", "metadata": {}, "outputs": [], "source": [ "ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + geom_histogram(position=\"identity\", alpha=0.8)" ] }, { "cell_type": "markdown", "id": "equal-munich", "metadata": {}, "source": [ "### Labels and Axes\n", "\n", "It's always a good idea to label your axes. This can be done most easily with `xlab` and `ylab`. We can also use `ggtitle` to add a title. Let's pull in the same plot from above, and add labels." ] }, { "cell_type": "code", "execution_count": null, "id": "falling-gallery", "metadata": {}, "outputs": [], "source": [ "(ggplot(iris, aes(x=iris.sepal_length, fill=iris.species)) + \n", " geom_histogram(position=\"identity\", alpha=0.8) + \n", " xlab(\"Sepal Length\") + ylab(\"Number of samples\") + ggtitle(\"Sepal length by flower type\")\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.17" } }, "nbformat": 4, "nbformat_minor": 5 }