Part One
Split–Apply–Combine
In Chapter 5 you computed average prices by filtering for each category individually — one filter call per category. That works for five categories. It does not scale to fifty, or five hundred. .groupby() solves this elegantly.
The idea behind groupby is called split–apply–combine:
- Split — divide the DataFrame into groups based on the values of one column.
- Apply — compute a statistic (mean, sum, count…) for each group independently.
- Combine — gather the results into a new table, one row per group.
In one line you get the mean price for every category at once. The result is a Series with the category names as the index. From this alone you can already report: meat and fish are by far the most expensive categories; grains and vegetables are the most affordable.
Part Two
Different Aggregations
You can call any of the statistical methods from Chapter 5 on a grouped Series: .mean(), .median(), .sum(), .min(), .max(), .count(), .std().
Part Three
.agg() — Multiple Statistics at Once
When you want several statistics per group simultaneously, .agg() (short for aggregate) lets you pass a list of statistic names. The result is a clean table — perfect for a published data summary.
This is publication-ready output. You can see at a glance that the Fish category has the widest range (€1.59 to €8.99) and the Meat category has the highest mean price per item. This type of table is the backbone of most data-driven food price stories.
.agg():df.groupby("category")["price"].agg(items="count", avg="mean", max_price="max")
Part Four
Groupby Multiple Columns
You can group by more than one column at a time by passing a list to .groupby(). The result has a multi-level index — one level per grouping column. This lets you cross-tabulate, for example: average price by category and organic status.
A MultiIndex is just an index made of more than one label. Instead of each row being identified by a single label such as Dairy, it is identified by a pair such as Dairy, True or Dairy, False. In other words, pandas is using both category and organic to label each result row.
The output shows a MultiIndex — categories on the outer level, True/False on the inner level. For categories where only organic or only non-organic items exist (like Bakery), only one row appears. This kind of cross-tabulation is extremely useful for investigating whether the organic premium varies by food category.
Flattening a MultiIndex with .reset_index()
If you want an ordinary table again, the simplest method is .reset_index(). It turns the index levels back into normal columns.
Now category and organic are regular columns again, which is often easier to read, sort, or save.
Flattening by spreading one level into columns with .unstack()
Another method is .unstack(). It takes one index level and spreads it across the columns. Here, the organic values become column headers.
.reset_index() when you want a normal DataFrame with ordinary columns. Use .unstack() when you want one of the index levels to become columns.
Part Five
Your Turn — Category Price Report
Use .groupby() and .agg() to build a summary table that a journalist could use in a food price article. Include: number of items per category, average price, and the range (max − min) per category.
.groupby() with a single column and with multiple columns; computing statistics per group; using .agg() to get multiple statistics in one table; and naming aggregated columns. Next chapter: sorting results and adding computed columns to your DataFrame.
Chapter Navigation
Move between chapters.