Book 5 · Chapter Six — Extracting Text and Attributes

Part One

stripped_strings — Line by Line

When a tag contains nested text on separate lines, stripped_strings gives you the clean pieces one by one. This is useful when a card contains the date, title, and author as separate text nodes.

Python · Try it

This gives you separate text nodes instead of one long messy string. You will often use it together with get_text(" ", strip=True) when a title contains line breaks.

Good default: First inspect list(tag.stripped_strings). If the text is already nicely separated, keep it. If you want one clean sentence, use tag.get_text(" ", strip=True).

Part Two

Reading Attributes

The visible title is text. The article URL and publication timestamp are attributes. Use tag.get("href") and tag.get("datetime") to read them.

Python · Try it

Attributes often hold the most useful machine-readable values on a page. The title is for humans. The link and the ISO datetime are perfect for data collection.

Use .get() for safety: tag["href"] crashes if the attribute is missing. tag.get("href") returns None instead.

Part Three

Navigating with .parent

Sometimes you find the smallest useful tag first, like span.card-title, and then need to move upward to reach the link or the full article card.

Python · Try it

This pattern is common in scraping: locate the most specific tag first, then climb to the surrounding container.

Think in containers: once you have the full article card, you can search inside it for the time, title, and author without mixing data from nearby cards.

Part Four

find_next_sibling()

The article card places related fields next to each other: time, then link, then author. find_next_sibling() lets you move across that structure.

Python · Try it

Use find_next_sibling(), not .next_sibling. The raw sibling is often just a newline or whitespace node.

Good use case: when a page puts fields side-by-side instead of nesting them inside the same tag, sibling navigation is usually the cleanest solution.

Part Five

Your Turn — Build One Record

Extract the date, title, author, and link from the card below, then store them in a dictionary.

Python · Your turn

from bs4 import BeautifulSoup

html = """
<div class="design_one article-item is-flex is-flex-direction-column author-article">
  <span class="posted-on">
    <time class="entry-date published" datetime="2026-04-22T08:53:00+03:00">22.04.2026 • 08:53</time>
  </span>
  <a class="py-4 mainlink" href="/culture/music/564187225/pou-vgainoume-simera-tetarti-protaseis-gia-theatro-synaylies-ektheseis-ekdiloseis/">
    <span class="card-title">Πού βγαίνουμε σήμερα Τετάρτη; Προτάσεις για θέατρο, συναυλίες, εκθέσεις, εκδηλώσεις</span>
  </a>
  <span class="mb-2 author-link">Ελένη Σαμπάνη</span>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
article = soup.find("div", class_="article-item")

time_tag = article.find("time")
link_tag = article.find("a", class_="mainlink")
title_tag = article.find("span", class_="card-title")
author_tag = article.find("span", class_="author-link")

record = {
    "published_at": time_tag.get("datetime"),
    "title": title_tag.get_text(" ", strip=True),
    "author": author_tag.get_text(" ", strip=True),
    "link": link_tag.get("href"),
}

print(record)

What you learned in this chapter: how to extract text cleanly, how to read attributes like href and datetime, how to move upward with .parent, how to move sideways with find_next_sibling(), and how to turn one article card into one Python dictionary. In the next chapter you will do this for many cards and load the result into a DataFrame.

Chapter Navigation

Move between chapters.

Previous: Chapter 5 — Searching by Class and id Next: Chapter 7 — From Dictionaries to DataFrames