Book 5 — Web Scraping with Python

Python for All

Chapter One — What is Web Scraping?

Thanasis Troboukis  ·  All Books

Book Five · Chapter One

What is Web Scraping?

Millions of public web pages contain useful data with no download button. In this book you will scrape the latest-news feed at https://www.kathimerini.gr/epikairothta/, collect article titles and links, turn them into dictionaries and DataFrames, and save them to CSV files.

The Web as a Data Source

News websites, public institutions, and company sites publish data every day, but they publish it as webpages for humans to read. A page of latest news already contains structured information: article titles, links, dates, authors, and categories. The job of a scraper is to read that structure automatically.

In this book, the live source is the latest-news page of Kathimerini. By the end of Chapter 8, you will have a script that visits the first page, then page 2, page 3, and so on, extracts the article titles and links, and stores everything in CSV format.

What you need: Two Python libraries — requests to fetch the page and beautifulsoup4 to read the HTML. Install them once from the terminal with pip install requests beautifulsoup4 pandas, or run !pip install requests beautifulsoup4 pandas in a Jupyter notebook cell. They are already loaded in this browser environment.

HTML Is Just Text

When your browser opens a webpage, what it receives is a text file full of HTML tags. A title is not a magical object. It is just text inside tags like <a>, <span>, and <time>.

Here is a small realistic slice of the Kathimerini page as raw HTML:

Python · Try it

      

The browser turns that text into a nicely formatted news card. Python sees the same thing as a string, which is exactly what makes it machine-readable.

Tags and attributes: The visible text is inside the tags. Extra information such as the article URL or publication timestamp lives in the attributes: href="..." and datetime="...".

BeautifulSoup to the Rescue

Reading HTML with raw string slicing is fragile. BeautifulSoup solves this by turning the HTML string into a navigable Python object that you can search.

Python · Try it

      

That is the core idea of scraping: find the tag, get the text, get the attribute value. Later chapters will repeat this pattern with loops, dictionaries, and functions.

The line soup = BeautifulSoup(html, "html.parser") takes two arguments. The first is the HTML string you want to parse — in your scraper this will be response.text. The second is the parser: the engine that reads the raw text and builds the internal tree structure. BeautifulSoup supports several parsers. The one you pass determines how the HTML is interpreted and how tolerant the parser is of malformed markup.

Parser How to use Install Notes
html.parser BeautifulSoup(html, "html.parser") Built into Python — no install needed Good default. Handles real-world pages well.
lxml BeautifulSoup(html, "lxml") pip install lxml Very fast. Best choice for large pages or speed-critical scripts.
lxml-xml BeautifulSoup(xml, "lxml-xml") pip install lxml For XML documents, not HTML.
html5lib BeautifulSoup(html, "html5lib") pip install html5lib Parses exactly like a browser. Slowest, but most lenient with broken HTML.
Note on load time: The first time you press Run in a Book 5 chapter, the page loads Python, BeautifulSoup, and pandas. This can take a few extra seconds. After that, cells run instantly.

What You Can and Cannot Scrape

Before scraping a site, check whether there is an API or a direct download first. If not, look at the site's robots.txt file and be polite with your request rate.

A latest-news page is a classic beginner use case: public, regularly updated, and clearly structured. You still have to behave responsibly and avoid making unnecessary requests.

Every well-maintained site publishes a robots.txt file at its root — for example https://www.kathimerini.gr/robots.txt. It tells crawlers which paths they may and may not visit. Here is a simplified example:

User-agent: *
Allow: /epikairothta/
Allow: /epikairothta/page/
Disallow: /wp-admin/
Crawl-delay: 10

User-agent: * means the rules apply to all bots. Allow lines list paths that are explicitly permitted — scraping /epikairothta/ and its paginated sub-pages is fine. Disallow: /wp-admin/ means the admin backend is off-limits. Crawl-delay: 10 asks crawlers to wait at least 10 seconds between requests — a rule you should respect when running a script that fetches many pages in a row.

The short checklist: Is there an API? Is there a CSV download? Does robots.txt allow the path? Are you using delays and small page counts while learning? If yes, you are on the right track.

Your Turn — Read the Structure

prettify() re-formats HTML so you can inspect its structure before writing selectors. This is one of the best debugging habits in scraping.

Python · Your turn

      
What you learned in this chapter: that web pages are structured text; that article titles and links live in predictable tags and attributes; that BeautifulSoup turns raw HTML into a searchable object; and that prettify() helps you read the structure before you scrape it. In the next chapter you will fetch the real Kathimerini page with requests.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…