Book 5 — Web Scraping with Python

Python for All

Chapter Six — Extracting Text and Attributes

Thanasis Troboukis  ·  All Books

Book Five · Chapter Six

Extracting Text and Attributes

The latest-news page already contains the exact fields you want: publication time, article title, author, and link. In this chapter you will extract those values cleanly from the article cards.

stripped_strings — Line by Line

When a tag contains nested text on separate lines, stripped_strings gives you the clean pieces one by one. This is useful when a card contains the date, title, and author as separate text nodes.

Python · Try it

      

This gives you separate text nodes instead of one long messy string. You will often use it together with get_text(" ", strip=True) when a title contains line breaks.

Good default: First inspect list(tag.stripped_strings). If the text is already nicely separated, keep it. If you want one clean sentence, use tag.get_text(" ", strip=True).

Reading Attributes

The visible title is text. The article URL and publication timestamp are attributes. Use tag.get("href") and tag.get("datetime") to read them.

Python · Try it

      

Attributes often hold the most useful machine-readable values on a page. The title is for humans. The link and the ISO datetime are perfect for data collection.

Use .get() for safety: tag["href"] crashes if the attribute is missing. tag.get("href") returns None instead.

Navigating with .parent

Sometimes you find the smallest useful tag first, like span.card-title, and then need to move upward to reach the link or the full article card.

Python · Try it

      

This pattern is common in scraping: locate the most specific tag first, then climb to the surrounding container.

Think in containers: once you have the full article card, you can search inside it for the time, title, and author without mixing data from nearby cards.

find_next_sibling()

The article card places related fields next to each other: time, then link, then author. find_next_sibling() lets you move across that structure.

Python · Try it

      

Use find_next_sibling(), not .next_sibling. The raw sibling is often just a newline or whitespace node.

Good use case: when a page puts fields side-by-side instead of nesting them inside the same tag, sibling navigation is usually the cleanest solution.

Your Turn — Build One Record

Extract the date, title, author, and link from the card below, then store them in a dictionary.

Python · Your turn

      
What you learned in this chapter: how to extract text cleanly, how to read attributes like href and datetime, how to move upward with .parent, how to move sideways with find_next_sibling(), and how to turn one article card into one Python dictionary. In the next chapter you will do this for many cards and load the result into a DataFrame.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…