Extracting text from HTML while preserving its formatting can be very useful for web scraping and data mining projects. The Python BeautifulSoup module provides easy ways to target text based on HTML tags and attributes to extract formatted text.
In this comprehensive tutorial, you'll learn different methods to scrape text from a web page and retain original formatting like bold, italics, and newlines using BeautifulSoup.
We'll cover:
- Installing and importing BeautifulSoup
- Parsing HTML
- Targeting elements by tag, class and CSS selectors
- Retrieving text with .get_text() and .strings
- Preserving formatting with .encode_contents()
- Controlling whitespace and newlines
- Best practices for nested formatting
And more. By the end, you'll have the skills to extract text with formatting from any web page for your web scraping and data mining projects.
Introduction to Web Scraping
Web scraping refers to the practice of automatically collecting information from the web. This could include:
- Extracting financial data from a company's website into a spreadsheet
- Gathering product descriptions and pricing from an online store
- Compiling headlines and article text from news sites
Web scrapers allow you to harvest unstructured data from HTML pages and convert it into structured, usable data.
Some common uses:
- Price monitoring and market research
- Sentiment analysis on social media posts
- Building datasets for machine learning
- Monitoring websites for new content
- Researching trends using search engine scrapers
Web scrapers use a variety of tools to programmatically retrieve and parse content from the web. In this tutorial, we'll focus specifically on using Python and BeautifulSoup.
Installing BeautifulSoup
Beautiful Soup is a popular Python library designed for navigating, searching, and modifying parse trees. It is extremely useful for web scraping tasks.
To follow along, you'll need Beautiful Soup 4 installed. The easiest way is via pip:
pip install beautifulsoup4
This will install both bs4
and soupsieve
, which are dependencies.
You'll also need the requests
module to fetch page content:
pip install requests
Now let's see how to use them to extract formatted text…
Parsing HTML with BeautifulSoup
The first step is to fetch the HTML content of a web page using requests
, and parse it with BeautifulSoup
.
This creates a parsed document tree that we can then navigate and search:
import requests from bs4 import BeautifulSoup page = requests.get("http://example.com") soup = BeautifulSoup(page.content, 'html.parser')
This parses the page.content
using the Python standard library html.parser.
BeautifulSoup supports other parsers like lxml
for speed and html5lib
for spec compliance.
Now let's look at different methods to target elements and extract text while preserving formatting.
Targeting Elements by Tag
To extract text with formatting, we can target HTML tags like <b>
for bold text:
bold = soup.find_all("b") for item in bold: print(item.text)
This prints out all the text enclosed in <b>
tags.
Some other common formatting tags:
<i>
– Italic text<em>
– Emphasized text<strong>
– Important text<br>
– Line break
For example, to extract all italicized text:
italics = soup.find_all("i") for item in italics: print(item.text)
These find_all() examples only return direct child elements. To also target nested tags, pass recursive=True
.
Targeting Elements by CSS Class
Along with HTML tags, we can use CSS selectors to target elements by class or id attributes.
For example, to extract text from all elements with class article-text
:
text = soup.select(".article-text") print(text[0].get_text())
Some other useful CSS selectors:
#some-id
– Elements with id attribute some-id[target]
– Elements with attribute target.some-class p
– Paragraphs inside elements with class some-class
Chaining multiple selectors lets you precisely target nested formatting.
Preserving Line Breaks with .get_text()
By default, the .get_text()
method will strip out all HTML tags and join the text with spaces:
<p>This is <b>bold</b> <i>text</i></p>
print(soup.get_text()) # Output: 'This is bold text'
To preserve line breaks from <br>
tags, pass strip=True
and separator='\n'
:
print(soup.get_text(strip=True, separator="\n")) # Output: # 'This is bold text # Text with a line break'
This joins text with newlines instead of spaces.
Keeping Formatting Tags with .encode_contents()
To extract text while keeping original HTML formatting tags, use .encode_contents()
instead of .get_text()
:
bold = soup.find("b").encode_contents() print(bold) # Output: <b>Bold text</b>
Looping through elements and encoding their contents preserves the formatting:
formatted_text = [] for item in soup.select("b, i, em"): formatted_text.append(item.encode_contents()) print(formatted_text)
This stores the HTML encoded text in a list.
Controlling Whitespace
When printing or processing extracted text, you may want to normalize whitespace.
Add .strip()
to remove leading/trailing whitespace:
text = soup.get_text().strip()
To replace all whitespace with single spaces:
" ".join(text.split())
Or to remove blank lines:
"\n".join([ll for ll in text.split('\n') if ll])
BeautifulSoup also has a get_text()
parameter called strip_whitespace
which strips all whitespace, including tabs and newlines.
Best Practices for Nested Formatting
When scraping text with mixed nested formatting, the ideal method depends on your end goal:
- To fully preserve formatting, use
.encode_contents()
- To get clean text, use
.get_text()
- To partially preserve newlines and lists, use options like
separator
In general:
- Target elements by tag when extracting specific formatting like
<b>
- Use
.select()
for class/id attributes to extract generic content - Handle nested tags by looping and encoding each child element
- Parse text in stages if completely stripping then partially reformatting
With the above tools, you can handle even complex nested HTML formatting.
Recap and Summary
Let's recap what we've learned about extracting text with formatting using Python and BeautifulSoup:
- Install BeautifulSoup and Requests for web scraping
- Import BeautifulSoup and Requests
- Fetch page HTML using Requests
- Parse into a BeautifulSoup object
- Find elements by tag, id, class, CSS selector
- Extract text with
.get_text()
or.strings
- Preserve newlines by passing
separator='\n'
- Keep HTML tags with
.encode_contents()
- Normalize whitespace for cleaner text
- Use chaining and nesting to target elements
- Loop through elements to extract mixed formatting
BeautifulSoup provides many tools to accurately scrape text from HTML while preserving original formatting. With the techniques covered here, you should be able to extract formatted text from any web page to build datasets for your projects!