Book 5 · Chapter Eight — Pagination and the Full Pipeline

Part One

Cleaning Titles and Links

Raw text and raw URLs often need one small cleaning step before you store them. Titles may contain extra whitespace. Links may be relative paths that need the site's base URL.

Python · Try it

Two useful habits: clean the text before storing it, and save full absolute links in your CSV whenever possible.

Part Two

Building the Logic Inline

You do not need extra abstractions to scrape multiple pages. You can build the page URL, clean the link, and create the article dictionaries directly inside the loop.

Python · Try it

from bs4 import BeautifulSoup

BASE_URL = "https://www.kathimerini.gr"
START_URL = "https://www.kathimerini.gr/epikairothta/"

html = """
<div class="latest-news">
  <a class="py-4 mainlink" href="/economy/business/564187087/epekteinetai-i-cordia-me-exagora-tis-ellinikis-energeiakis-zeb/">
    <span class="card-title">Επεκτείνεται η CORDIA με εξαγορά της ελληνικής ενεργειακής ZEB</span>
  </a>
  <a class="py-4 mainlink" href="/society/dikastiko/564187411/sto-ypoyrgeio-dikaiosynis-i-kovesi-synantisi-me-floridi/">
    <span class="card-title">Στο υπουργείο Δικαιοσύνης η Κοβέσι – Συνάντηση με Φλωρίδη</span>
  </a>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
page_number = 2

if page_number == 1:
    page_url = START_URL
else:
    page_url = f"{START_URL}page/{page_number}/"

records = []

for position, card in enumerate(soup.find_all("a", class_="mainlink"), start=1):
    title_tag = card.find("span", class_="card-title")
    if not title_tag:
        continue

href = card.get("href", "")
    if href.startswith("/"):
        href = BASE_URL + href

records.append({
        "page": page_number,
        "position": position,
        "title": " ".join(title_tag.get_text(" ", strip=True).split()),
        "link": href,
    })

print(page_url)
print()
print(records)

This version uses only the basic tools from the earlier books: variables, if, for, enumerate(), dictionaries, and list append().

Basic Python review: you can build a real multi-page scraper with just loops, conditionals, dictionaries, and lists.

Part Three

The Complete One-Page Scraper

Before looping through many pages, make sure page 1 works cleanly from end to end.

Python · Copy to your notebook

import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE_URL = "https://www.kathimerini.gr"
START_URL = "https://www.kathimerini.gr/epikairothta/"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "el-GR,el;q=0.9,en-US;q=0.8",
}

response = requests.get(START_URL, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "html.parser")

records = []

for position, card in enumerate(soup.find_all("a", class_="mainlink"), start=1):
    title_tag = card.find("span", class_="card-title")
    if not title_tag:
        continue

    href = card.get("href", "")
    if href.startswith("/"):
        href = BASE_URL + href

    records.append({
        "page": 1,
        "position": position,
        "title": " ".join(title_tag.get_text(" ", strip=True).split()),
        "link": href,
    })

df = pd.DataFrame(records)
df.to_csv("kathimerini_page_1.csv", index=False)

print("Saved:", "kathimerini_page_1.csv")
print("Rows :", len(df))

Build in stages: do not start with a multi-page loop. First make page 1 work. Then generalise.

Part Four

Looping Through Page 1, Page 2, Page 3

Pagination is just a loop. Use range() to visit page 1, then page 2, then page 3, and collect all results into one big list.

Python · Copy to your notebook

import requests
from bs4 import BeautifulSoup
import pandas as pd

BASE_URL = "https://www.kathimerini.gr"
START_URL = "https://www.kathimerini.gr/epikairothta/"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "el-GR,el;q=0.9,en-US;q=0.8",
}

all_records = []

for page_number in range(1, 4):
    if page_number == 1:
        page_url = START_URL
    else:
        page_url = f"{START_URL}page/{page_number}/"

    response = requests.get(page_url, headers=headers, timeout=30)
    soup = BeautifulSoup(response.text, "html.parser")

    page_records = []

    for position, card in enumerate(soup.find_all("a", class_="mainlink"), start=1):
        title_tag = card.find("span", class_="card-title")
        if not title_tag:
            continue

        href = card.get("href", "")
        if href.startswith("/"):
            href = BASE_URL + href

        page_records.append({
            "page": page_number,
            "position": position,
            "title": " ".join(title_tag.get_text(" ", strip=True).split()),
            "link": href,
        })

    print(f"Page {page_number}: {len(page_records)} articles")
    all_records.extend(page_records)

df = pd.DataFrame(all_records)
df.to_csv("kathimerini_pages_1_to_3.csv", index=False)

print()
print("Saved:", "kathimerini_pages_1_to_3.csv")
print("Total rows:", len(df))

That is the full pagination pattern. The special-case URL logic appears directly inside the loop, and the extraction logic is written directly under it.

What changed from Chapter 7? only two things: a small if/else block that builds the correct page URL, and a for page_number in range(...) loop that repeats the same scraper across multiple pages.

Part Five

Your Turn — Simulate Pagination

The cell below simulates page 1 and page 2 with static HTML, so you can run the multi-page logic directly in the browser.

Python · Your turn

from bs4 import BeautifulSoup
import pandas as pd

BASE_URL = "https://www.kathimerini.gr"
START_URL = "https://www.kathimerini.gr/epikairothta/"

pages = {
    1: """
    <div class="latest-news">
      <a class="py-4 mainlink" href="/economy/business/564187087/epekteinetai-i-cordia-me-exagora-tis-ellinikis-energeiakis-zeb/">
        <span class="card-title">Επεκτείνεται η CORDIA με εξαγορά της ελληνικής ενεργειακής ZEB</span>
      </a>
      <a class="py-4 mainlink" href="/society/dikastiko/564187411/sto-ypoyrgeio-dikaiosynis-i-kovesi-synantisi-me-floridi/">
        <span class="card-title">Στο υπουργείο Δικαιοσύνης η Κοβέσι – Συνάντηση με Φλωρίδη</span>
      </a>
    </div>
    """,
    2: """
    <div class="latest-news">
      <a class="py-4 mainlink" href="/athletics/football/564186910/soyper-ligk-1-krisimi-agonistiki-gia-ta-plei-aoyt/">
        <span class="card-title">Σούπερ Λιγκ 1: Κρίσιμη αγωνιστική για τα πλέι άουτ</span>
      </a>
      <a class="py-4 mainlink" href="/culture/music/564187225/pou-vgainoume-simera-tetarti-protaseis-gia-theatro-synaylies-ektheseis-ekdiloseis/">
        <span class="card-title">Πού βγαίνουμε σήμερα Τετάρτη; Προτάσεις για θέατρο, συναυλίες, εκθέσεις, εκδηλώσεις</span>
      </a>
    </div>
    """,
}

all_records = []

for page_number in range(1, 3):
    if page_number == 1:
        page_url = START_URL
    else:
        page_url = f"{START_URL}page/{page_number}/"

soup = BeautifulSoup(pages[page_number], "html.parser")
    page_records = []

for position, card in enumerate(soup.find_all("a", class_="mainlink"), start=1):
        title_tag = card.find("span", class_="card-title")
        if not title_tag:
            continue

href = card.get("href", "")
        if href.startswith("/"):
            href = BASE_URL + href

page_records.append({
            "page": page_number,
            "position": position,
            "title": " ".join(title_tag.get_text(" ", strip=True).split()),
            "link": href,
        })

print(page_url, "->", len(page_records), "records")
    all_records.extend(page_records)

df = pd.DataFrame(all_records)
print()
print(df.to_string(index=False))
print()
print(df.to_csv(index=False))

What you learned in this book: how to fetch live HTML with requests; how to search the page with BeautifulSoup; how to extract titles and links; how to build dictionaries and DataFrames; how to write CSV files; and how to use loops plus range() to scrape multiple pages like .../page/2/ and .../page/3/. You now have a complete template for scraping any paginated news feed.

Chapter Navigation

Move between chapters.

Previous: Chapter 7 — From Dictionaries to DataFrames Back to: All Books