Book 4 · Chapter Two — Exploring Your Dataset

Part One

The Dataset We Will Work With

Throughout this chapter — and the rest of Book 4 — we will work with a grocery price survey. Imagine a market reporter has recorded the price of 15 common food items, their category, unit of measurement, and whether they are organically certified. This kind of data is the raw material of food price journalism and consumer research.

Run the cell below to create our dataset. You will use this same table for all the exercises in this chapter.

Python · Run this first

That is 15 rows and 5 columns — a modest but realistic slice of a supermarket price list. Now let us learn how to interrogate it.

Part Two

Shape, Columns, and Data Types

The first questions you ask about any dataset are: how big is it? and what kind of data does it contain? pandas answers both in one line each.

.shape — how many rows and columns?

Python · Try it

df.shape returns a tuple: (rows, columns). On a real dataset with hundreds of thousands of rows, this is the very first thing you check. Real-world data files often contain far more rows than you expect.

.columns and .dtypes — what are the fields?

Python · Try it

The dtypes output tells you how pandas has interpreted each column. object is the pandas term for text (strings). float64 means decimal numbers. bool means True/False. Knowing the data type of each column matters because many pandas operations only work on numeric columns.

Common data types: int64 — whole numbers · float64 — decimals · object — text · bool — True/False · datetime64 — dates and times

Part Three

.head() and .tail() — Peeking at the Data

When a dataset has thousands of rows, printing the whole thing is unhelpful. pandas provides two methods for a quick look at the top or bottom of the table.

Python · Try it

By default, .head() shows the first 5 rows and .tail() shows the last 5. Pass a number to get a different count: .head(10), .tail(3). Professional data analysts almost always start an exploration session by calling .head() immediately after loading a file.

Part Four

.describe() — Instant Summary Statistics

One of the most powerful quick-look tools in pandas is .describe(). Call it on a DataFrame and you get a full statistical summary of every numeric column in a single table.

Python · Try it

Read the output row by row:

count — how many non-missing values exist (useful for spotting missing data)
mean — the average value
std — short for standard deviation. It tells you whether the prices are all in roughly the same range or very mixed. Here it is 2.638145, which is quite large for a dataset with an average price of about €2.94. That makes sense: some items are very cheap, like Onions at €0.69, while others are very expensive, like Salmon at €8.99. So the prices are spread out, not bunched close together.
min / max — the smallest and largest values
25% — 25% of the prices are at or below this value. Here, about a quarter of the items cost €1.09 or less.
50% — the median. Half of the items cost €1.59 or less, and half cost more.
75% — 75% of the prices are at or below this value. Here, most items cost €4.39 or less, and only the most expensive few are above that.

This one table already tells a simple story: the average price is €2.94, but the median is lower at €1.59. That means a few expensive items — especially fish and meat — are pulling the average upward.

Include text columns: By default, .describe() only summarises numeric columns. To include text columns too, use df.describe(include="all"). This adds count, unique, top (most common value), and freq (how often it appears) for text columns.

Python · Try it — describe all columns

The top and freq rows for the category column tell you that Dairy is the most common category, appearing 4 times — something you might not have noticed by eye.

Part Five

Your Turn — First Look at a New Dataset

Below is a second dataset — weekly prices at a different market. Use the exploration tools you have learned to answer these questions without reading the raw data by eye:

How many rows and columns does the dataset have?
What data types are the columns?
What is the most expensive item?
What is the average price?

Python · Your turn

What you learned in this chapter: how to inspect a new dataset with .shape, .columns, .dtypes, .head(), .tail(), and .describe(). These six tools are the first things every data analyst runs on any new file. In the next chapter you will learn to select specific rows and columns.

Chapter Navigation

Move between chapters.

Previous: Chapter 1 — What is pandas? Next: Chapter 3 — Selecting Data