How to Extract Text with Formatting Using Beautifulsoup?

Extracting text from HTML while preserving its formatting can be very useful for web scraping and data mining projects. The Python BeautifulSoup module provides easy ways to target text based on HTML tags and attributes to extract formatted text.

In this comprehensive tutorial, you'll learn different methods to scrape text from a web page and retain original formatting like bold, italics, and newlines using BeautifulSoup.

We'll cover:

  • Installing and importing BeautifulSoup
  • Parsing HTML
  • Targeting elements by tag, class and CSS selectors
  • Retrieving text with .get_text() and .strings
  • Preserving formatting with .encode_contents()
  • Controlling whitespace and newlines
  • Best practices for nested formatting

And more. By the end, you'll have the skills to extract text with formatting from any web page for your web scraping and data mining projects.

Introduction to Web Scraping

Web scraping refers to the practice of automatically collecting information from the web. This could include:

  • Extracting financial data from a company's website into a spreadsheet
  • Gathering product descriptions and pricing from an online store
  • Compiling headlines and article text from news sites

Web scrapers allow you to harvest unstructured data from HTML pages and convert it into structured, usable data.

Some common uses:

  • Price monitoring and market research
  • Sentiment analysis on social media posts
  • Building datasets for machine learning
  • Monitoring websites for new content
  • Researching trends using search engine scrapers

Web scrapers use a variety of tools to programmatically retrieve and parse content from the web. In this tutorial, we'll focus specifically on using Python and BeautifulSoup.

Installing BeautifulSoup

Beautiful Soup is a popular Python library designed for navigating, searching, and modifying parse trees. It is extremely useful for web scraping tasks.

To follow along, you'll need Beautiful Soup 4 installed. The easiest way is via pip:

pip install beautifulsoup4

This will install both bs4 and soupsieve, which are dependencies.

You'll also need the requests module to fetch page content:

pip install requests

Now let's see how to use them to extract formatted text…

Parsing HTML with BeautifulSoup

The first step is to fetch the HTML content of a web page using requests, and parse it with BeautifulSoup.

This creates a parsed document tree that we can then navigate and search:

import requests
from bs4 import BeautifulSoup

page = requests.get("http://example.com")
soup = BeautifulSoup(page.content, 'html.parser')

This parses the page.content using the Python standard library html.parser.

BeautifulSoup supports other parsers like lxml for speed and html5lib for spec compliance.

Now let's look at different methods to target elements and extract text while preserving formatting.

Targeting Elements by Tag

To extract text with formatting, we can target HTML tags like <b> for bold text:

bold = soup.find_all("b")

for item in bold:
  print(item.text)

This prints out all the text enclosed in <b> tags.

Some other common formatting tags:

  • <i> – Italic text
  • <em> – Emphasized text
  • <strong> – Important text
  • <br> – Line break

For example, to extract all italicized text:

italics = soup.find_all("i")

for item in italics:
  print(item.text)

These find_all() examples only return direct child elements. To also target nested tags, pass recursive=True.

Targeting Elements by CSS Class

Along with HTML tags, we can use CSS selectors to target elements by class or id attributes.

For example, to extract text from all elements with class article-text:

text = soup.select(".article-text")

print(text[0].get_text())

Some other useful CSS selectors:

  • #some-id – Elements with id attribute some-id
  • [target] – Elements with attribute target
  • .some-class p – Paragraphs inside elements with class some-class

Chaining multiple selectors lets you precisely target nested formatting.

Preserving Line Breaks with .get_text()

By default, the .get_text() method will strip out all HTML tags and join the text with spaces:

<p>This is <b>bold</b> <i>text</i></p>
print(soup.get_text())

# Output: 'This is bold text'

To preserve line breaks from <br> tags, pass strip=True and separator='\n':

print(soup.get_text(strip=True, separator="\n"))

# Output: 
# 'This is bold text
# Text with a line break'

This joins text with newlines instead of spaces.

Keeping Formatting Tags with .encode_contents()

To extract text while keeping original HTML formatting tags, use .encode_contents() instead of .get_text():

bold = soup.find("b").encode_contents()

print(bold)
# Output: <b>Bold text</b>

Looping through elements and encoding their contents preserves the formatting:

formatted_text = []

for item in soup.select("b, i, em"):
  formatted_text.append(item.encode_contents())

print(formatted_text)

This stores the HTML encoded text in a list.

Controlling Whitespace

When printing or processing extracted text, you may want to normalize whitespace.

Add .strip() to remove leading/trailing whitespace:

text = soup.get_text().strip()

To replace all whitespace with single spaces:

" ".join(text.split())

Or to remove blank lines:

"\n".join([ll for ll in text.split('\n') if ll])

BeautifulSoup also has a get_text() parameter called strip_whitespace which strips all whitespace, including tabs and newlines.

Best Practices for Nested Formatting

When scraping text with mixed nested formatting, the ideal method depends on your end goal:

  • To fully preserve formatting, use .encode_contents()
  • To get clean text, use .get_text()
  • To partially preserve newlines and lists, use options like separator

In general:

  • Target elements by tag when extracting specific formatting like <b>
  • Use .select() for class/id attributes to extract generic content
  • Handle nested tags by looping and encoding each child element
  • Parse text in stages if completely stripping then partially reformatting

With the above tools, you can handle even complex nested HTML formatting.

Recap and Summary

Let's recap what we've learned about extracting text with formatting using Python and BeautifulSoup:

  • Install BeautifulSoup and Requests for web scraping
  • Import BeautifulSoup and Requests
  • Fetch page HTML using Requests
  • Parse into a BeautifulSoup object
  • Find elements by tag, id, class, CSS selector
  • Extract text with .get_text() or .strings
  • Preserve newlines by passing separator='\n'
  • Keep HTML tags with .encode_contents()
  • Normalize whitespace for cleaner text
  • Use chaining and nesting to target elements
  • Loop through elements to extract mixed formatting

BeautifulSoup provides many tools to accurately scrape text from HTML while preserving original formatting. With the techniques covered here, you should be able to extract formatted text from any web page to build datasets for your projects!

John Rooney

John Rooney

John Watson Rooney, a self-taught Python developer and content creator with a focus on web scraping, APIs, and automation. I love sharing my knowledge and expertise through my YouTube channel, My channel caters to all levels of developers, from beginners looking to get started in web scraping to experienced programmers seeking to advance their skills with modern techniques. I have worked in the e-commerce sector for many years, gaining extensive real-world experience in data handling, API integrations, and project management. I am passionate about teaching others and simplifying complex concepts to make them more accessible to a wider audience. In addition to my YouTube channel, I also maintain a personal website where I share my coding projects and other related content.

We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0