BeautifulSoup is a very popular Python library used for parsing and extracting data from HTML and XML documents. It provides a simple way to navigate, search, and modify the parse tree of HTML documents that you encounter while web scraping.
One of the most common use cases of BeautifulSoup is to find and extract all the links from a web page. Links are defined using anchor (<a>) tags in HTML, so we just need to find all <a> tags, get the href attribute, and we have our list of links!
In this guide, I'll explain how to:
- Use BeautifulSoup to find all anchor tags
- Extract the href attribute to get the links
- Handle relative vs absolute links
- Make relative links absolute
- Filter out external links
By the end, you'll have a good understanding of how to find and extract all links from an HTML document using the power of BeautifulSoup and Python.
Brief Intro to BeautifulSoup
Before we dive into the code, let's briefly go over BeautifulSoup itself. BeautifulSoup is a Python library that makes it easy to parse and navigate HTML and XML documents. It creates a parse tree from the document that allows you to treat it as nested Python objects which you can interact with.
You pass the HTML document to the BeautifulSoup constructor to create a BeautifulSoup
object. You can then treat this object like a tree and use methods like find()
, find_all()
, select()
etc to navigate through the tree and extract data.
For example:
from bs4 import BeautifulSoup html = """ <html> <body> <h1>Hello World</h1> <p>This is a paragraph</p> </body> </html> """ soup = BeautifulSoup(html, 'html.parser') soup.find('h1').text # 'Hello World' soup.find('p').text # 'This is a paragraph'
This quick example shows how we can use BeautifulSoup to parse an HTML document stored in a string variable, and then interact with it like a Python object. With this basic understanding of BeautifulSoup, let's now see how we can use it to find all the links on a web page.
Finding All Anchor Tags with find_all()
The first step is to find all the <a> anchor tags in the document. This will get us all the links. BeautifulSoup provides a find_all()
method for this exact purpose. It allows us to search through the parse tree and find all elements that match our criteria.
To find all <a> tags, we simply pass in 'a'
as the parameter:
links = soup.find_all('a')
This will return a list of all the <a>
tags present on the page. Each tag will be a BeautifulSoup Tag object that we can further interact with. For example:
from bs4 import BeautifulSoup html = """ <html> <body> <h1>My Website</h1> <p>Welcome to my site!</p> <a href="/about">About</a> <a href="/contact">Contact</a> </body> </html> """ soup = BeautifulSoup(html, 'html.parser') links = soup.find_all('a') print(links) # [<a href="/about">About</a>, <a href="/contact">Contact</a>]
This finds both the <a>
tags present in our example HTML. With the anchor tags found, next up is extracting the link URLs.
Extracting the HREF Attribute to Get Link URLs
Each <a>
tag uses the href
attribute to specify the link URL. We can access tag attributes through the BeautifulSoup interface.To extract the link URLs, we loop through each tag and use get()
to access the href
attribute:
for link in links: print(link.get('href')) # /about # /contact
This prints out the href URL for each link tag. Putting it together:
from bs4 import BeautifulSoup html = # HTML content... soup = BeautifulSoup(html, 'html.parser') links = soup.find_all('a') for link in links: print(link.get('href'))
And we have extracted all the link URLs from the page! One thing to note here is that the links we have extracted so far are relative URLs to the current page. Some links might be absolute URLs as well. Let's discuss this distinction next.
Understanding Relative vs Absolute Links
There are two types of links you will encounter on web pages:
- Relative links – These are links relative to the current page, and start with /, ./ or no slash. For example:
<a href="/about">About</a> <a href="./contact">Contact</a>
- Absolute links – These are complete URLs that contain the protocol and domain. For example:
<a href="https://www.example.com/blog">Blog</a> <a href="http://wikipedia.org">Wikipedia</a>
When scraping a website, we usually want to convert relative links to absolute so we have the full URL paths to follow. BeautifulSoup extracts links as they are defined in the HTML, so we will get a mix of relative and absolute URLs. To handle this, we can use Python's urllib
module to convert relatives to absolutes.
Converting Relative Links to Absolute
We'll use urllib.parse.urljoin
to convert relative URLs to absolute based on a provided base URL. For example:
from urllib.parse import urljoin base_url = 'https://example.com' relative = '/about' absolute = urljoin(base_url, relative) print(absolute) # https://example.com/about
To make all links absolute, we extract the base URL, then use urljoin on each relative link:
from urllib.parse import urljoin from bs4 import BeautifulSoup html = # HTML content... soup = BeautifulSoup(html, 'html.parser') base_url = "https://example.com" links = [] for link in soup.find_all('a'): url = link.get('href') if url.startswith('/'): url = urljoin(base_url, url) links.append(url) print(links)
Now links
will contain the absolute version of all relative URLs. This ensures we have the complete link path rather than a fragment.
When to Choose Relative vs Absolute links?
For your own sites, relative links make pages more portable and flexible if you reorganize the site structure. They also work well during local development and testing.
For linking externally, absolute URLs are better since they contain the full destination. Relative paths won't work across different domains. Get in the habit of inspecting for both when scraping sites. Converting relatives to absolutes will ensure you have the complete URL path for further processing and extraction.
Up next, we'll see how to filter out unwanted external links when scraping.
import tldextract allowed_domain = 'wikipedia.org' for link in links: ext = tldextract.extract(link) current_domain = ext.registered_domain if current_domain != allowed_domain: # external link, ignore it continue # internal link, process it print(link)
This prints out only links to wikipedia.org, removing any external ones. The same technique works for allowing multiple domains:
allowed_domains = ['wikipedia.org', 'wikimedia.org'] for link in links: ext = tldextract.extract(link) if ext.registered_domain not in allowed_domains: continue # ignore print(link) # internal link
Now links to Wikimedia projects like Wiktionary or Wikidata will also be included. Filtering by domain helps focus your scraper on the target site and avoid wasting resources downloading external content.
In addition to tldextract
, you can use Python's built-in urlparse
module to handle URLs and domains. With the basics covered, let's now dive into some more advanced link extraction and processing tactics.
Filtering Out External Links
Sometimes you only want to extract links pointing to a certain domain, and ignore external links. For example, when scraping example.com
, you may want to ignore links to other sites like Facebook or Twitter.
We can filter these out by checking the domain of each absolute URL, and excluding ones that don't match. The tldextract
library makes this easy in Python. It can extract and compare domains for us.
First, install tldextract:
pip install tldextract
Then we can use it like this:
python
import tldextract allowed_domain = 'example.com' for link in links: ext = tldextract.extract(link) domain = ext.registered_domain if domain != allowed_domain: continue # ignore external links print(link) # print internal links
This will print out only links to example.com
, ignoring any external ones.
Putting It All Together
To recap, here is the full script to extract all internal links from a page:
from bs4 import BeautifulSoup from urllib.parse import urljoin import tldextract html = # page HTML base_url = "https://example.com" soup = BeautifulSoup(html, 'html.parser') links = [] for a_tag in soup.findAll('a'): href = a_tag.get('href') if href.startswith('/'): href = urljoin(base_url, href) ext = tldextract.extract(href) if ext.domain != allowed_domain: continue links.append(href) print(links)
This will:
- Use BeautifulSoup to parse HTML and find all anchor tags
- Extract the href attribute from each anchor tag
- Convert relative URLs to absolute based on the base_url
- Filter out external links and keep only example.com links
- Store the final list of internal links
And that's it! With these steps, you can easily find and extract all the internal links from an HTML document in Python using BeautifulSoup.
Storing Scraped Links for Further Processing
For scrapers that gather lots of links across an entire site, you'll need to store them somewhere for further processing. Here are some good storage options:
- MySQL – Relational database that provides ACID transactions and ability to run rich queries for analysis. Can become slow for very high throughput apps.
- Postgres – A more performant relational database, handles concurrency very well. Supports advanced features like JSON columns.
- Redis – Super fast in-memory data store, ideal for queues and caches. Provides atomic operations. Downside is data lost if server restarts.
- SQLite – Simple file-based SQL database, makes it easy to store data locally when scraping. Drawback of no client-server architecture.
- Apache Kafka – Distributed streaming platform, lets you pipe links between microservices and process as stream. Adds complexity.
- Sitemaps – Can output sitemap.xml files representing your scraped links for SEO.
For our link scraping use case, Postgres and Redis make a great combination:
- Use Redis to hold the live link queue and processing set. Provides fast operations as you scrape.
- As links are processed, save them to Postgres for more structured queries and analysis.
Here's an example schema for Postgres:
CREATE TABLE links ( id SERIAL PRIMARY KEY, url VARCHAR(255) NOT NULL, page_title TEXT, scraped_at TIMESTAMP NOT NULL );
This gives you a historical record of all links scraped from a site that can be queried very efficiently, even with millions of rows. Choosing the right storage provides flexibility on how links are accessed as you build out the rest of your scraper.
Testing Scraped Links
It's good practice to test scraped links for validity before processing them further. This avoids wasting resources on broken or inaccessible URLs. Some useful techniques:
- HTTP HEAD requests – Use HEAD instead of GET to efficiently check link status without downloading the body:
import requests response = requests.head(url) if response.status_code == 200: # link is valid, scrape it
- Check for redirects – Resolve any 3xx redirects to find the canonical URL:
import requests response = requests.get(url, allow_redirects=False) if 300 <= response.status_code < 400: url = response.headers['Location'] # get redirect
- Assert correct domain – Double check the domain matches what you expect:
from urllib.parse import urlparse domain = urlparse(url).netloc assert domain == "example.com"
- Set timeouts – Avoid hanging on slow links by setting short timeouts, then retrying later:
requests.get(url, timeout=5)
Validating links upfront saves wasted effort downloading invalid or inaccessible content. For large volumes, consider a distributed link checking system that can test thousands of URLs concurrently.
Tracking Link Changes Over Time
The web is constantly changing, so links you scrape today may go stale over time. To monitor this, track link metrics over your scraping history:
- Count of dead 404 links
- Redirected links and where they point
- New links added
- Links removed
- Links that changed section or URL path
You can run periodic jobs to re-scrape and compare against your link store. Changes can then be analyzed. This lets you detect shifts like re-architected websites, removed content, changed navigation, and more. Common link changes to expect:
- Domain migrations – Site moved to new domain, links now broken
- Content removed – Old blog posts/products deprecated, links dead
- Site redesign – Pages and content reorganized, links moved
- New content – Additional blog/help/category sections added
Alerting on big deltas and changes gives you signal to update any scrapers or downstream consumers relying on those links. For content-heavy sites like news and ecommerce that update daily, automating link change monitoring is highly recommended.
Additional Tips for Working with Links
Here are some additional tips when extracting links with BeautifulSoup:
- Handle Broken Links: Some
<a>
tags may not have a valid href attribute. Use defensive coding to avoid errors:
href = a_tag.get('href') if href: # extract link
- Extract Link Text: To get text inside the
<a>
tag, usea_tag.text
:
link_text = a_tag.text
- Remove Duplicate Links: Use a Python set to store unique links only:
links = set() # instead of a list
- Limit to Certain Sections: You can pass a tag name to
find_all()
to narrow your search area. For example, to only search within<nav>
:
nav_links = soup.nav.find_all('a')
- Find by Class/ID: Use a CSS selector like
a.some-class
ora#some-id
to limit matches.
There are many possibilities! Refer to the BeautifulSoup documentation for advanced queries.
Summary
Link extraction serves as the foundation for most robust web scraping projects. BeautifulSoup makes it very easy to parse, navigate, and extract data from HTML and XML in Python. I hope this guide gave you a good understanding of how to use it to extract all the links from a page.