Book 4 · Chapter Six — Groupby and Aggregation

Part One

Split–Apply–Combine

In Chapter 5 you computed average prices by filtering for each category individually — one filter call per category. That works for five categories. It does not scale to fifty, or five hundred. .groupby() solves this elegantly.

The idea behind groupby is called split–apply–combine:

Split — divide the DataFrame into groups based on the values of one column.
Apply — compute a statistic (mean, sum, count…) for each group independently.
Combine — gather the results into a new table, one row per group.

Python · Try it — groupby mean

In one line you get the mean price for every category at once. The result is a Series with the category names as the index. From this alone you can already report: meat and fish are by far the most expensive categories; grains and vegetables are the most affordable.

Part Two

Different Aggregations

You can call any of the statistical methods from Chapter 5 on a grouped Series: .mean(), .median(), .sum(), .min(), .max(), .count(), .std().

Python · Try it

Part Three

.agg() — Multiple Statistics at Once

When you want several statistics per group simultaneously, .agg() (short for aggregate) lets you pass a list of statistic names. The result is a clean table — perfect for a published data summary.

Python · Try it — .agg() with a list

This is publication-ready output. You can see at a glance that the Fish category has the widest range (€1.59 to €8.99) and the Meat category has the highest mean price per item. This type of table is the backbone of most data-driven food price stories.

Naming your columns: You can give the aggregated columns custom names inside .agg():
df.groupby("category")["price"].agg(items="count", avg="mean", max_price="max")

Python · Try it — named aggregations

Part Four

Groupby Multiple Columns

You can group by more than one column at a time by passing a list to .groupby(). The result has a multi-level index — one level per grouping column. This lets you cross-tabulate, for example: average price by category and organic status.

A MultiIndex is just an index made of more than one label. Instead of each row being identified by a single label such as Dairy, it is identified by a pair such as Dairy, True or Dairy, False. In other words, pandas is using both category and organic to label each result row.

Python · Try it — two-column groupby

The output shows a MultiIndex — categories on the outer level, True/False on the inner level. For categories where only organic or only non-organic items exist (like Bakery), only one row appears. This kind of cross-tabulation is extremely useful for investigating whether the organic premium varies by food category.

Flattening a MultiIndex with `.reset_index()`

If you want an ordinary table again, the simplest method is .reset_index(). It turns the index levels back into normal columns.

Python · Try it — flatten with reset_index()

Now category and organic are regular columns again, which is often easier to read, sort, or save.

Flattening by spreading one level into columns with `.unstack()`

Another method is .unstack(). It takes one index level and spreads it across the columns. Here, the organic values become column headers.

Python · Try it — flatten with unstack()

Two useful patterns: use .reset_index() when you want a normal DataFrame with ordinary columns. Use .unstack() when you want one of the index levels to become columns.

Part Five

Your Turn — Category Price Report

Use .groupby() and .agg() to build a summary table that a journalist could use in a food price article. Include: number of items per category, average price, and the range (max − min) per category.

Python · Your turn

What you learned in this chapter: the split–apply–combine concept; .groupby() with a single column and with multiple columns; computing statistics per group; using .agg() to get multiple statistics in one table; and naming aggregated columns. Next chapter: sorting results and adding computed columns to your DataFrame.

Chapter Navigation

Move between chapters.

Previous: Chapter 5 — Summary Statistics Next: Chapter 7 — Sorting and New Columns

Groupby and Aggregation

Split–Apply–Combine

Different Aggregations

.agg() — Multiple Statistics at Once

Groupby Multiple Columns

Flattening a MultiIndex with .reset_index()

Flattening by spreading one level into columns with .unstack()

Your Turn — Category Price Report

Flattening a MultiIndex with `.reset_index()`

Flattening by spreading one level into columns with `.unstack()`