Part One
The Dataset We Will Work With
Throughout this chapter — and the rest of Book 4 — we will work with a grocery price survey. Imagine a market reporter has recorded the price of 15 common food items, their category, unit of measurement, and whether they are organically certified. This kind of data is the raw material of food price journalism and consumer research.
Run the cell below to create our dataset. You will use this same table for all the exercises in this chapter.
That is 15 rows and 5 columns — a modest but realistic slice of a supermarket price list. Now let us learn how to interrogate it.
Part Two
Shape, Columns, and Data Types
The first questions you ask about any dataset are: how big is it? and what kind of data does it contain? pandas answers both in one line each.
.shape — how many rows and columns?
df.shape returns a tuple: (rows, columns). On a real dataset with hundreds of thousands of rows, this is the very first thing you check. Real-world data files often contain far more rows than you expect.
.columns and .dtypes — what are the fields?
The dtypes output tells you how pandas has interpreted each column. object is the pandas term for text (strings). float64 means decimal numbers. bool means True/False. Knowing the data type of each column matters because many pandas operations only work on numeric columns.
int64 — whole numbers · float64 — decimals · object — text · bool — True/False · datetime64 — dates and times
Part Three
.head() and .tail() — Peeking at the Data
When a dataset has thousands of rows, printing the whole thing is unhelpful. pandas provides two methods for a quick look at the top or bottom of the table.
By default, .head() shows the first 5 rows and .tail() shows the last 5. Pass a number to get a different count: .head(10), .tail(3). Professional data analysts almost always start an exploration session by calling .head() immediately after loading a file.
Part Four
.describe() — Instant Summary Statistics
One of the most powerful quick-look tools in pandas is .describe(). Call it on a DataFrame and you get a full statistical summary of every numeric column in a single table.
Read the output row by row:
- count — how many non-missing values exist (useful for spotting missing data)
- mean — the average value
- std — short for standard deviation. It tells you whether the prices are all in roughly the same range or very mixed. Here it is
2.638145, which is quite large for a dataset with an average price of about €2.94. That makes sense: some items are very cheap, like Onions at €0.69, while others are very expensive, like Salmon at €8.99. So the prices are spread out, not bunched close together. - min / max — the smallest and largest values
- 25% — 25% of the prices are at or below this value. Here, about a quarter of the items cost €1.09 or less.
- 50% — the median. Half of the items cost €1.59 or less, and half cost more.
- 75% — 75% of the prices are at or below this value. Here, most items cost €4.39 or less, and only the most expensive few are above that.
This one table already tells a simple story: the average price is €2.94, but the median is lower at €1.59. That means a few expensive items — especially fish and meat — are pulling the average upward.
.describe() only summarises numeric columns. To include text columns too, use df.describe(include="all"). This adds count, unique, top (most common value), and freq (how often it appears) for text columns.
The top and freq rows for the category column tell you that Dairy is the most common category, appearing 4 times — something you might not have noticed by eye.
Part Five
Your Turn — First Look at a New Dataset
Below is a second dataset — weekly prices at a different market. Use the exploration tools you have learned to answer these questions without reading the raw data by eye:
- How many rows and columns does the dataset have?
- What data types are the columns?
- What is the most expensive item?
- What is the average price?
.shape, .columns, .dtypes, .head(), .tail(), and .describe(). These six tools are the first things every data analyst runs on any new file. In the next chapter you will learn to select specific rows and columns.
Chapter Navigation
Move between chapters.