How to Extract Text with Formatting Using Beautifulsoup?

Extracting text from HTML while preserving its formatting can be very useful for web scraping and data mining projects. The Python BeautifulSoup module provides easy ways to target text based on HTML tags and attributes to extract formatted text.

In this comprehensive tutorial, you'll learn different methods to scrape text from a web page and retain original formatting like bold, italics, and newlines using BeautifulSoup.

We'll cover:

Installing and importing BeautifulSoup
Parsing HTML
Targeting elements by tag, class and CSS selectors
Retrieving text with .get_text() and .strings
Preserving formatting with .encode_contents()
Controlling whitespace and newlines
Best practices for nested formatting

And more. By the end, you'll have the skills to extract text with formatting from any web page for your web scraping and data mining projects.

Introduction to Web Scraping

Web scraping refers to the practice of automatically collecting information from the web. This could include:

Extracting financial data from a company's website into a spreadsheet
Gathering product descriptions and pricing from an online store
Compiling headlines and article text from news sites

Web scrapers allow you to harvest unstructured data from HTML pages and convert it into structured, usable data.

Some common uses:

Price monitoring and market research
Sentiment analysis on social media posts
Building datasets for machine learning
Monitoring websites for new content
Researching trends using search engine scrapers

Web scrapers use a variety of tools to programmatically retrieve and parse content from the web. In this tutorial, we'll focus specifically on using Python and BeautifulSoup.

Installing BeautifulSoup

Beautiful Soup is a popular Python library designed for navigating, searching, and modifying parse trees. It is extremely useful for web scraping tasks.

To follow along, you'll need Beautiful Soup 4 installed. The easiest way is via pip:

pip install beautifulsoup4

This will install both bs4 and soupsieve, which are dependencies.

You'll also need the requests module to fetch page content:

pip install requests

Now let's see how to use them to extract formatted text…

Parsing HTML with BeautifulSoup

The first step is to fetch the HTML content of a web page using requests, and parse it with BeautifulSoup.

This creates a parsed document tree that we can then navigate and search:

import requests
from bs4 import BeautifulSoup

page = requests.get("http://example.com")
soup = BeautifulSoup(page.content, 'html.parser')

This parses the page.content using the Python standard library html.parser.

BeautifulSoup supports other parsers like lxml for speed and html5lib for spec compliance.

Now let's look at different methods to target elements and extract text while preserving formatting.

Targeting Elements by Tag

To extract text with formatting, we can target HTML tags like  for bold text:

bold = soup.find_all("b")

for item in bold:
  print(item.text)

This prints out all the text enclosed in  tags.

Some other common formatting tags:

 – Italic text
 – Emphasized text
 – Important text
  – Line break

For example, to extract all italicized text:

italics = soup.find_all("i")

for item in italics:
  print(item.text)

These find_all() examples only return direct child elements. To also target nested tags, pass recursive=True.

Targeting Elements by CSS Class

Along with HTML tags, we can use CSS selectors to target elements by class or id attributes.

For example, to extract text from all elements with class article-text:

text = soup.select(".article-text")

print(text[0].get_text())

Some other useful CSS selectors:

#some-id – Elements with id attribute some-id
[target] – Elements with attribute target
.some-class p – Paragraphs inside elements with class some-class

Chaining multiple selectors lets you precisely target nested formatting.

Preserving Line Breaks with .get_text()

By default, the .get_text() method will strip out all HTML tags and join the text with spaces:

<p>This is <b>bold</b> <i>text</i></p>

print(soup.get_text())

# Output: 'This is bold text'

To preserve line breaks from   tags, pass strip=True and separator='\n':

print(soup.get_text(strip=True, separator="\n"))

# Output: 
# 'This is bold text
# Text with a line break'

This joins text with newlines instead of spaces.

Keeping Formatting Tags with .encode_contents()

To extract text while keeping original HTML formatting tags, use .encode_contents() instead of .get_text():

bold = soup.find("b").encode_contents()

print(bold)
# Output: <b>Bold text</b>

Looping through elements and encoding their contents preserves the formatting:

formatted_text = []

for item in soup.select("b, i, em"):
  formatted_text.append(item.encode_contents())

print(formatted_text)

This stores the HTML encoded text in a list.

Controlling Whitespace

When printing or processing extracted text, you may want to normalize whitespace.

Add .strip() to remove leading/trailing whitespace:

text = soup.get_text().strip()

To replace all whitespace with single spaces:

" ".join(text.split())

Or to remove blank lines:

"\n".join([ll for ll in text.split('\n') if ll])

BeautifulSoup also has a get_text() parameter called strip_whitespace which strips all whitespace, including tabs and newlines.

Best Practices for Nested Formatting

When scraping text with mixed nested formatting, the ideal method depends on your end goal:

To fully preserve formatting, use .encode_contents()
To get clean text, use .get_text()
To partially preserve newlines and lists, use options like separator

In general:

Target elements by tag when extracting specific formatting like 
Use .select() for class/id attributes to extract generic content
Handle nested tags by looping and encoding each child element
Parse text in stages if completely stripping then partially reformatting

With the above tools, you can handle even complex nested HTML formatting.

Recap and Summary

Let's recap what we've learned about extracting text with formatting using Python and BeautifulSoup:

Install BeautifulSoup and Requests for web scraping
Import BeautifulSoup and Requests
Fetch page HTML using Requests
Parse into a BeautifulSoup object
Find elements by tag, id, class, CSS selector
Extract text with .get_text() or .strings
Preserve newlines by passing separator='\n'
Keep HTML tags with .encode_contents()
Normalize whitespace for cleaner text
Use chaining and nesting to target elements
Loop through elements to extract mixed formatting

BeautifulSoup provides many tools to accurately scrape text from HTML while preserving original formatting. With the techniques covered here, you should be able to extract formatted text from any web page to build datasets for your projects!