Book 5 · Chapter Seven — From Dictionaries to DataFrames

Part One

One Card Becomes One Dictionary

The natural unit of scraped data is a dictionary. One article card becomes one dictionary with named fields like title, link, author, and published_at.

Python · Try it

This uses basic Python you already know: strings, variables, dictionaries, and function calls. Scraping is not a new language. It is ordinary Python applied to HTML.

Part Two

Many Cards Become a List of Dicts

The next step is a loop: find all article cards, create one dictionary per card, and append() each dictionary to a list.

Python · Try it

from bs4 import BeautifulSoup

html = """
<div class="latest-news">
  <div class="design_one article-item is-flex is-flex-direction-column author-article">
    <a class="py-4 mainlink" href="/economy/business/564187087/epekteinetai-i-cordia-me-exagora-tis-ellinikis-energeiakis-zeb/">
      <span class="card-title">Επεκτείνεται η CORDIA με εξαγορά της ελληνικής ενεργειακής ZEB</span>
    </a>
  </div>
  <div class="design_one article-item is-flex is-flex-direction-column author-article">
    <a class="py-4 mainlink" href="/society/dikastiko/564187411/sto-ypoyrgeio-dikaiosynis-i-kovesi-synantisi-me-floridi/">
      <span class="card-title">Στο υπουργείο Δικαιοσύνης η Κοβέσι – Συνάντηση με Φλωρίδη</span>
    </a>
  </div>
  <div class="design_one article-item is-flex is-flex-direction-column author-article">
    <a class="py-4 mainlink" href="/culture/music/564187225/pou-vgainoume-simera-tetarti-protaseis-gia-theatro-synaylies-ektheseis-ekdiloseis/">
      <span class="card-title">Πού βγαίνουμε σήμερα Τετάρτη; Προτάσεις για θέατρο, συναυλίες, εκθέσεις, εκδηλώσεις</span>
    </a>
  </div>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
cards = soup.find_all("a", class_="mainlink")

records = []

for card in cards:
    title_tag = card.find("span", class_="card-title")
    if not title_tag:
        continue

record = {
        "title": title_tag.get_text(" ", strip=True),
        "link": card.get("href"),
    }
    records.append(record)

print("Records:", len(records))
for record in records:
    print(record)

This is the core data-collection loop of the whole book: find many containers, loop over them, build dictionaries, append them to a list.

Books 1, 2, and 3 all appear here: variables, list creation, for loops, if guards, dictionaries, and append(). The scraper works because these basic tools combine well.

Part Three

List of Dicts to DataFrame

Pandas reads a list of dictionaries directly. Each dictionary becomes one row. Each key becomes one column.

Python · Try it

Why DataFrames matter: once the data is in pandas, filtering, sorting, exporting, and later analysis become much easier.

Part Four

Saving the First CSV

A DataFrame can be written to disk with to_csv(). This is the point where scraping becomes useful outside the notebook.

Python · Copy to your notebook

import requests
from bs4 import BeautifulSoup
import pandas as pd

URL = "https://www.kathimerini.gr/epikairothta/"
BASE_URL = "https://www.kathimerini.gr"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "el-GR,el;q=0.9,en-US;q=0.8",
}

response = requests.get(URL, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "html.parser")

records = []

for card in soup.find_all("a", class_="mainlink"):
    title_tag = card.find("span", class_="card-title")
    if not title_tag:
        continue

    href = card.get("href", "")
    if href.startswith("/"):
        href = BASE_URL + href

    records.append({
        "title": title_tag.get_text(" ", strip=True),
        "link": href,
    })

df = pd.DataFrame(records)
df.to_csv("kathimerini_page_1.csv", index=False)

print("Saved:", "kathimerini_page_1.csv")
print("Rows :", len(df))

The CSV file will contain one row per article and two columns: title and link.

Make links absolute: many sites use relative links like /society/.... Add the base URL before saving so the CSV contains complete links that work everywhere.

Part Five

Your Turn — Scrape One Page

The snapshot below contains four article cards. Extract the titles and links, build a DataFrame, and preview the CSV text.

Python · Your turn

from bs4 import BeautifulSoup
import pandas as pd

html = """
<div class="latest-news">
  <a class="py-4 mainlink" href="/economy/business/564187087/epekteinetai-i-cordia-me-exagora-tis-ellinikis-energeiakis-zeb/">
    <span class="card-title">Επεκτείνεται η CORDIA με εξαγορά της ελληνικής ενεργειακής ZEB</span>
  </a>
  <a class="py-4 mainlink" href="/society/dikastiko/564187411/sto-ypoyrgeio-dikaiosynis-i-kovesi-synantisi-me-floridi/">
    <span class="card-title">Στο υπουργείο Δικαιοσύνης η Κοβέσι – Συνάντηση με Φλωρίδη</span>
  </a>
  <a class="py-4 mainlink" href="/athletics/football/564186910/soyper-ligk-1-krisimi-agonistiki-gia-ta-plei-aoyt/">
    <span class="card-title">Σούπερ Λιγκ 1: Κρίσιμη αγωνιστική για τα πλέι άουτ</span>
  </a>
  <a class="py-4 mainlink" href="/culture/music/564187225/pou-vgainoume-simera-tetarti-protaseis-gia-theatro-synaylies-ektheseis-ekdiloseis/">
    <span class="card-title">Πού βγαίνουμε σήμερα Τετάρτη; Προτάσεις για θέατρο, συναυλίες, εκθέσεις, εκδηλώσεις</span>
  </a>
</div>
"""

BASE_URL = "https://www.kathimerini.gr"

soup = BeautifulSoup(html, "html.parser")
records = []

for card in soup.find_all("a", class_="mainlink"):
    title = card.find("span", class_="card-title").get_text(" ", strip=True)
    link = BASE_URL + card.get("href")
    records.append({"title": title, "link": link})

df = pd.DataFrame(records)
print(df.to_string(index=False))
print()
print(df.to_csv(index=False))

What you learned in this chapter: how one article becomes one dictionary, how many dictionaries become a list, how pandas turns that list into a DataFrame, and how to save a one-page scraper to CSV. In the final chapter you will repeat the same idea across page 1, page 2, page 3, and beyond.

Chapter Navigation

Move between chapters.

Previous: Chapter 6 — Extracting Text and Attributes Next: Chapter 8 — Pagination and the Full Pipeline