Book 4 — Data Analysis with Python

Python for All

Chapter Six — Groupby and Aggregation

Thanasis Troboukis  ·  All Books

Book Four · Chapter Six

Groupby and Aggregation

The single most powerful pattern in data analysis: split your data into groups, compute a statistic for each group, and compare. pandas makes this a one-liner with .groupby().

Split–Apply–Combine

In Chapter 5 you computed average prices by filtering for each category individually — one filter call per category. That works for five categories. It does not scale to fifty, or five hundred. .groupby() solves this elegantly.

The idea behind groupby is called split–apply–combine:

  1. Split — divide the DataFrame into groups based on the values of one column.
  2. Apply — compute a statistic (mean, sum, count…) for each group independently.
  3. Combine — gather the results into a new table, one row per group.
Python · Try it — groupby mean

      

In one line you get the mean price for every category at once. The result is a Series with the category names as the index. From this alone you can already report: meat and fish are by far the most expensive categories; grains and vegetables are the most affordable.

Different Aggregations

You can call any of the statistical methods from Chapter 5 on a grouped Series: .mean(), .median(), .sum(), .min(), .max(), .count(), .std().

Python · Try it

      

.agg() — Multiple Statistics at Once

When you want several statistics per group simultaneously, .agg() (short for aggregate) lets you pass a list of statistic names. The result is a clean table — perfect for a published data summary.

Python · Try it — .agg() with a list

      

This is publication-ready output. You can see at a glance that the Fish category has the widest range (€1.59 to €8.99) and the Meat category has the highest mean price per item. This type of table is the backbone of most data-driven food price stories.

Naming your columns: You can give the aggregated columns custom names inside .agg():
df.groupby("category")["price"].agg(items="count", avg="mean", max_price="max")
Python · Try it — named aggregations

      

Groupby Multiple Columns

You can group by more than one column at a time by passing a list to .groupby(). The result has a multi-level index — one level per grouping column. This lets you cross-tabulate, for example: average price by category and organic status.

A MultiIndex is just an index made of more than one label. Instead of each row being identified by a single label such as Dairy, it is identified by a pair such as Dairy, True or Dairy, False. In other words, pandas is using both category and organic to label each result row.

Python · Try it — two-column groupby

      

The output shows a MultiIndex — categories on the outer level, True/False on the inner level. For categories where only organic or only non-organic items exist (like Bakery), only one row appears. This kind of cross-tabulation is extremely useful for investigating whether the organic premium varies by food category.

Flattening a MultiIndex with .reset_index()

If you want an ordinary table again, the simplest method is .reset_index(). It turns the index levels back into normal columns.

Python · Try it — flatten with reset_index()

      

Now category and organic are regular columns again, which is often easier to read, sort, or save.

Flattening by spreading one level into columns with .unstack()

Another method is .unstack(). It takes one index level and spreads it across the columns. Here, the organic values become column headers.

Python · Try it — flatten with unstack()

      
Two useful patterns: use .reset_index() when you want a normal DataFrame with ordinary columns. Use .unstack() when you want one of the index levels to become columns.

Your Turn — Category Price Report

Use .groupby() and .agg() to build a summary table that a journalist could use in a food price article. Include: number of items per category, average price, and the range (max − min) per category.

Python · Your turn

      
What you learned in this chapter: the split–apply–combine concept; .groupby() with a single column and with multiple columns; computing statistics per group; using .agg() to get multiple statistics in one table; and naming aggregated columns. Next chapter: sorting results and adding computed columns to your DataFrame.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…