Book 5 — Web Scraping with Python

Python for All

Chapter Three — Your First BeautifulSoup Object

Thanasis Troboukis  ·  All Books

Book Five · Chapter Three

Your First BeautifulSoup Object

BeautifulSoup turns a raw HTML string into a Python object you can navigate like a document. In this chapter you learn to create that object, access any tag directly by name, and use get_text() to read the visible content of any element.

Creating a Soup Object

You create a BeautifulSoup object by passing two things: the HTML string and a parser name. The parser is the engine that reads the raw text and builds the internal structure. Use "html.parser" — it is built into Python and works for every real-world page you will encounter in this course.

Python · Try it

      

The prettify() output is not the actual data — it is a debugging view. You can see every tag, every attribute, every level of nesting. Whenever you are unsure what a page contains, run print(soup.prettify()) first and read the structure before writing any selectors.

Note on load time: The first time you press Run in a Book 5 chapter, the page loads both Python and the BeautifulSoup library. This can take 5–15 seconds. After that, all cells run instantly.

Accessing Tags by Name

The simplest way to reach an element is to access it as a property of the soup object — just type soup. followed by the tag name. soup.h3 gives you the first <h3> on the page. soup.p gives you the first <p>. soup.table gives you the first <table>.

Python · Try it

      

Notice that soup.div gives you the outermost <div> — the tabcontent one — not the inner panel div. Tag-name access always returns the first match in document order. To reach specific elements deeper in the page you will use find() and find_all(), which you will learn in the next chapter.

Tags have a .name property: Every BeautifulSoup tag object knows its own name: soup.h3.name returns the string "h3". This is useful when you are iterating over mixed elements and need to check what type each one is.

Reading Text with get_text()

Every BeautifulSoup tag has a get_text() method. It strips away all the HTML tags inside the element and returns only the human-readable text. This is the main method you will use throughout the book to extract actual values from elements.

Python · Try it

      

The difference matters. Without arguments, get_text() runs all the text together with only the original whitespace — which can produce strings like "\n Αττική\n Τύπος: ...\n". Passing " " as a separator and strip=True gives you a clean, readable string. Always use get_text(" ", strip=True) unless you specifically need the raw form.

get_text() vs .string: A tag also has a .string attribute that returns the text only if the element contains no child tags — just plain text. If there are child tags, .string returns None. Use get_text() because it always works, regardless of inner structure.

Navigating Into Nested Tags

You can chain tag-name access to navigate down into the document tree. soup.div.table.tr.td walks down the nesting — outermost div, then the table inside it, then the row inside the table, then the first cell. Each step returns the first matching child.

Python · Try it

      

Chaining is convenient for simple, predictable structures. In practice, real pages have more variation — sometimes a tag is missing, sometimes there are extra wrappers. The find() method you will learn next handles these cases more robustly, so it is what you will use most often.

None propagates: If any step in the chain finds nothing (because the tag does not exist), it returns None. The next .something on None will raise an AttributeError. The safe version of the chain above is soup.find("td"), which you will learn in the next chapter.

Your Turn — Read a Card's Contents

The cell below contains a complete incident card. Use soup. tag-name access and get_text() to extract three things: the section heading, the location from the first cell, and the start date from the second cell.

Python · Your turn

      
What you learned in this chapter: how to create a BeautifulSoup object with BeautifulSoup(html, "html.parser"); how prettify() helps you inspect the structure; how to access tags directly with dot notation; how get_text(" ", strip=True) extracts clean text; and how to chain tag access to navigate nested structures. In the next chapter you will use find() and find_all() to search for elements across the whole document.

Chapter Navigation

Move between chapters.

Loading Python environment — this may take a moment…