Aggregation Tutorial

In the last section, we inspected the structure of the data and displayed a few example values.

How do we get a deeper feel for the data? One of the most natural things to do is to create a summary of a large number of values. For example, you could ask:

  • How many women are in the dataset? How many men?

  • What is the average age? Youngest age? Oldest age?

  • What are all the occupations that appear, and how many times does each appear?

We can answer these questions with aggregation. Aggregation combines many values together to create a summary.

To start, we’ll aggregate all the values in a table. (Later, we’ll learn how to aggregate over subsets.)

We can do this with the Table.aggregate method.

A call to aggregate has two parts:

  • The expression to aggregate over (e.g. a field of a Table).

  • The aggregator to combine the values into the summary.

Hail has a large suite of aggregators for summarizing data. Let’s see some in action!

count

Aggregators live in the hl.agg module. The simplest aggregator is count. It takes no arguments and returns the number of values aggregated.

[1]:
import hail as hl
from bokeh.io import output_notebook,show
output_notebook()
hl.init()


hl.utils.get_movie_lens('data/')
users = hl.read_table('data/users.ht')
Loading BokehJS ...
Loading BokehJS ...
24/04/10 20:19:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Running on Apache Spark version 3.3.2
SparkUI available at http://hostname-98608ef83a:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.130-bea04d9c79b5
LOGGING: writing to /io/hail/python/hail/docs/tutorials/hail-20240410-2019-0.2.130-bea04d9c79b5.log
2024-04-10 20:19:48.400 Hail: INFO: Movie Lens files found!
[2]:
users.aggregate(hl.agg.count())
[2]:
943
[3]:
users.count()
[3]:
943

stats

stats computes useful statistics about a numeric expression at once. There are also aggregators for mean, min, max, sum, product and array_sum.

[4]:
users.show()
id
age
sex
occupation
zipcode
int32int32strstrstr
124"M""technician""85711"
253"F""other""94043"
323"M""writer""32067"
424"M""technician""43537"
533"F""other""15213"
642"M""executive""98101"
757"M""administrator""91344"
836"M""administrator""05201"
929"M""student""01002"
1053"M""lawyer""90703"

showing top 10 rows

[5]:
users.aggregate(hl.agg.stats(users.age))
[5]:
Struct(mean=34.05196182396607, stdev=12.186273150937211, min=7.0, max=73.0, n=943, sum=32111.0)

counter

What about non-numeric data, like the occupation field?

counter is modeled on the Python Counter object: it counts the number of times each distinct value occurs in the collection of values being aggregated.

[6]:
users.aggregate(hl.agg.counter(users.occupation))
[6]:
{'administrator': 79,
 'artist': 28,
 'doctor': 7,
 'educator': 95,
 'engineer': 67,
 'entertainment': 18,
 'executive': 32,
 'healthcare': 16,
 'homemaker': 7,
 'lawyer': 12,
 'librarian': 51,
 'marketing': 26,
 'none': 9,
 'other': 105,
 'programmer': 66,
 'retired': 14,
 'salesman': 12,
 'scientist': 31,
 'student': 196,
 'technician': 27,
 'writer': 45}

filter

You can filter elements of a collection before aggregation by using filter.

[7]:
users.aggregate(hl.agg.filter(users.sex == 'M', hl.agg.count()))
[7]:
670

The argument to filter should be a Boolean expression.

[8]:
users.aggregate(hl.agg.count_where(users.sex == 'M'))
[8]:
670

Boolean expressions can be compound, but be sure to use parentheses appropriately. A single ‘&’ denotes logical AND and a single ‘|’ denotes logical OR.

[9]:
users.aggregate(hl.agg.filter((users.occupation == 'writer') | (users.occupation == 'executive'), hl.agg.count()))
[9]:
77
[10]:
users.aggregate(hl.agg.filter((users.sex == 'F') | (users.occupation == 'executive'), hl.agg.count()))
[10]:
302

hist

As we saw in the first tutorial, hist can be used to build a histogram over numeric data.

[11]:
hist = users.aggregate(hl.agg.hist(users.age, 10, 70, 60))
hist
[11]:
Struct(bin_edges=[10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0, 43.0, 44.0, 45.0, 46.0, 47.0, 48.0, 49.0, 50.0, 51.0, 52.0, 53.0, 54.0, 55.0, 56.0, 57.0, 58.0, 59.0, 60.0, 61.0, 62.0, 63.0, 64.0, 65.0, 66.0, 67.0, 68.0, 69.0, 70.0], bin_freq=[1, 1, 0, 5, 3, 6, 5, 14, 18, 23, 32, 27, 37, 28, 33, 38, 34, 35, 36, 32, 39, 25, 28, 26, 17, 27, 21, 19, 17, 22, 21, 10, 21, 13, 23, 15, 12, 14, 20, 19, 20, 20, 6, 12, 4, 11, 6, 9, 3, 3, 9, 3, 2, 3, 2, 3, 1, 0, 2, 5], n_smaller=1, n_larger=1)
[12]:
p = hl.plot.histogram(hist, legend='Age')
show(p)

take and collect

There are a few aggregators for collecting values.

  • take localizes a few values into an array. It has an optional ordering.

  • collect localizes all values into an array.

  • collect_as_set localizes all unique values into a set.

[13]:
users.aggregate(hl.agg.take(users.occupation, 5))
[13]:
['technician', 'other', 'writer', 'technician', 'other']
[14]:
users.aggregate(hl.agg.take(users.age, 5, ordering=-users.age))
[14]:
[73, 70, 70, 70, 69]

Warning! Aggregators like collect and counter return Python objects and can fail with out of memory errors if you apply them to collections that are too large (e.g. all 50 trillion genotypes in the UK Biobank dataset).

[ ]: