How to Find All Links Using BeautifulSoup and Python?

BeautifulSoup is a very popular Python library used for parsing and extracting data from HTML and XML documents. It provides a simple way to navigate, search, and modify the parse tree of HTML documents that you encounter while web scraping.

One of the most common use cases of BeautifulSoup is to find and extract all the links from a web page. Links are defined using anchor (<a>) tags in HTML, so we just need to find all <a> tags, get the href attribute, and we have our list of links!

In this guide, I'll explain how to:

Use BeautifulSoup to find all anchor tags
Extract the href attribute to get the links
Handle relative vs absolute links
Make relative links absolute
Filter out external links

By the end, you'll have a good understanding of how to find and extract all links from an HTML document using the power of BeautifulSoup and Python.

Brief Intro to BeautifulSoup

Before we dive into the code, let's briefly go over BeautifulSoup itself. BeautifulSoup is a Python library that makes it easy to parse and navigate HTML and XML documents. It creates a parse tree from the document that allows you to treat it as nested Python objects which you can interact with.

You pass the HTML document to the BeautifulSoup constructor to create a BeautifulSoup object. You can then treat this object like a tree and use methods like find(), find_all(), select() etc to navigate through the tree and extract data.

For example:

from bs4 import BeautifulSoup

html = """
<html>
<body>
  <h1>Hello World</h1>
  <p>This is a paragraph</p>
</body>
</html>  
"""

soup = BeautifulSoup(html, 'html.parser')

soup.find('h1').text 
# 'Hello World'

soup.find('p').text
# 'This is a paragraph'

This quick example shows how we can use BeautifulSoup to parse an HTML document stored in a string variable, and then interact with it like a Python object. With this basic understanding of BeautifulSoup, let's now see how we can use it to find all the links on a web page.

Finding All Anchor Tags with `find_all()`

The first step is to find all the <a> anchor tags in the document. This will get us all the links. BeautifulSoup provides a find_all() method for this exact purpose. It allows us to search through the parse tree and find all elements that match our criteria.

To find all <a> tags, we simply pass in 'a' as the parameter:

links = soup.find_all('a')

This will return a list of all the <a> tags present on the page. Each tag will be a BeautifulSoup Tag object that we can further interact with. For example:

from bs4 import BeautifulSoup

html = """
<html>
<body>

<h1>My Website</h1>

<p>Welcome to my site!</p>

<a href="/about">About</a>
<a href="/contact">Contact</a>

</body>  
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

links = soup.find_all('a')

print(links)
# [<a href="/about">About</a>, <a href="/contact">Contact</a>]

This finds both the <a> tags present in our example HTML. With the anchor tags found, next up is extracting the link URLs.

Extracting the HREF Attribute to Get Link URLs

Each <a> tag uses the href attribute to specify the link URL. We can access tag attributes through the BeautifulSoup interface.To extract the link URLs, we loop through each tag and use get() to access the href attribute:

for link in links:
  print(link.get('href'))

# /about  
# /contact

This prints out the href URL for each link tag. Putting it together:

from bs4 import BeautifulSoup

html = # HTML content...

soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')

for link in links:
  print(link.get('href'))

And we have extracted all the link URLs from the page! One thing to note here is that the links we have extracted so far are relative URLs to the current page. Some links might be absolute URLs as well. Let's discuss this distinction next.

Understanding Relative vs Absolute Links

There are two types of links you will encounter on web pages:

Relative links – These are links relative to the current page, and start with /, ./ or no slash. For example:

<a href="/about">About</a> 
<a href="./contact">Contact</a>

Absolute links – These are complete URLs that contain the protocol and domain. For example:

<a href="https://www.example.com/blog">Blog</a>
<a href="http://wikipedia.org">Wikipedia</a>

When scraping a website, we usually want to convert relative links to absolute so we have the full URL paths to follow. BeautifulSoup extracts links as they are defined in the HTML, so we will get a mix of relative and absolute URLs. To handle this, we can use Python's urllib module to convert relatives to absolutes.

Converting Relative Links to Absolute

We'll use urllib.parse.urljoin to convert relative URLs to absolute based on a provided base URL. For example:

from urllib.parse import urljoin

base_url = 'https://example.com'

relative = '/about'
absolute = urljoin(base_url, relative)

print(absolute) 
# https://example.com/about

To make all links absolute, we extract the base URL, then use urljoin on each relative link:

from urllib.parse import urljoin
from bs4 import BeautifulSoup

html = # HTML content...  

soup = BeautifulSoup(html, 'html.parser')
base_url = "https://example.com"

links = []
for link in soup.find_all('a'):
  url = link.get('href')  
  if url.startswith('/'): 
    url = urljoin(base_url, url)
  
  links.append(url)

print(links)

Now links will contain the absolute version of all relative URLs. This ensures we have the complete link path rather than a fragment.

When to Choose Relative vs Absolute links?

For your own sites, relative links make pages more portable and flexible if you reorganize the site structure. They also work well during local development and testing.

For linking externally, absolute URLs are better since they contain the full destination. Relative paths won't work across different domains. Get in the habit of inspecting for both when scraping sites. Converting relatives to absolutes will ensure you have the complete URL path for further processing and extraction.

Up next, we'll see how to filter out unwanted external links when scraping.

import tldextract

allowed_domain = 'wikipedia.org'

for link in links:

  ext = tldextract.extract(link)
  current_domain = ext.registered_domain

  if current_domain != allowed_domain:
    # external link, ignore it
    continue 
  
  # internal link, process it
  print(link)

This prints out only links to wikipedia.org, removing any external ones. The same technique works for allowing multiple domains:

allowed_domains = ['wikipedia.org', 'wikimedia.org']

for link in links:

  ext = tldextract.extract(link)
  
  if ext.registered_domain not in allowed_domains:
     continue # ignore

  print(link) # internal link

Now links to Wikimedia projects like Wiktionary or Wikidata will also be included. Filtering by domain helps focus your scraper on the target site and avoid wasting resources downloading external content.

In addition to tldextract, you can use Python's built-in urlparse module to handle URLs and domains. With the basics covered, let's now dive into some more advanced link extraction and processing tactics.

Filtering Out External Links

Sometimes you only want to extract links pointing to a certain domain, and ignore external links. For example, when scraping example.com, you may want to ignore links to other sites like Facebook or Twitter.

We can filter these out by checking the domain of each absolute URL, and excluding ones that don't match. The tldextract library makes this easy in Python. It can extract and compare domains for us.

First, install tldextract:

pip install tldextract

Then we can use it like this:

python

import tldextract

allowed_domain = 'example.com'

for link in links:
  ext = tldextract.extract(link)
  domain = ext.registered_domain
  
  if domain != allowed_domain:
    continue # ignore external links

  print(link) # print internal links

This will print out only links to example.com, ignoring any external ones.

Putting It All Together

To recap, here is the full script to extract all internal links from a page:

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import tldextract

html = # page HTML
base_url = "https://example.com"  

soup = BeautifulSoup(html, 'html.parser')
links = []

for a_tag in soup.findAll('a'):
  
  href = a_tag.get('href')
  
  if href.startswith('/'):
    href = urljoin(base_url, href)
  
  ext = tldextract.extract(href)  
  if ext.domain != allowed_domain:
    continue

  links.append(href) 

print(links)

This will:

Use BeautifulSoup to parse HTML and find all anchor tags
Extract the href attribute from each anchor tag
Convert relative URLs to absolute based on the base_url
Filter out external links and keep only example.com links
Store the final list of internal links

And that's it! With these steps, you can easily find and extract all the internal links from an HTML document in Python using BeautifulSoup.

Storing Scraped Links for Further Processing

For scrapers that gather lots of links across an entire site, you'll need to store them somewhere for further processing. Here are some good storage options:

MySQL – Relational database that provides ACID transactions and ability to run rich queries for analysis. Can become slow for very high throughput apps.
Postgres – A more performant relational database, handles concurrency very well. Supports advanced features like JSON columns.
Redis – Super fast in-memory data store, ideal for queues and caches. Provides atomic operations. Downside is data lost if server restarts.
SQLite – Simple file-based SQL database, makes it easy to store data locally when scraping. Drawback of no client-server architecture.
Apache Kafka – Distributed streaming platform, lets you pipe links between microservices and process as stream. Adds complexity.
Sitemaps – Can output sitemap.xml files representing your scraped links for SEO.

For our link scraping use case, Postgres and Redis make a great combination:

Use Redis to hold the live link queue and processing set. Provides fast operations as you scrape.
As links are processed, save them to Postgres for more structured queries and analysis.

Here's an example schema for Postgres:

CREATE TABLE links (
  id SERIAL PRIMARY KEY,
  url VARCHAR(255) NOT NULL,
  page_title TEXT,
  scraped_at TIMESTAMP NOT NULL  
);

This gives you a historical record of all links scraped from a site that can be queried very efficiently, even with millions of rows. Choosing the right storage provides flexibility on how links are accessed as you build out the rest of your scraper.

Testing Scraped Links

It's good practice to test scraped links for validity before processing them further. This avoids wasting resources on broken or inaccessible URLs. Some useful techniques:

HTTP HEAD requests – Use HEAD instead of GET to efficiently check link status without downloading the body:

import requests

response = requests.head(url)
if response.status_code == 200:
   # link is valid, scrape it

Check for redirects – Resolve any 3xx redirects to find the canonical URL:

import requests

response = requests.get(url, allow_redirects=False)
if 300 <= response.status_code < 400:
   url = response.headers['Location'] # get redirect

Assert correct domain – Double check the domain matches what you expect:

from urllib.parse import urlparse

domain = urlparse(url).netloc
assert domain == "example.com"

Set timeouts – Avoid hanging on slow links by setting short timeouts, then retrying later:

requests.get(url, timeout=5)

Validating links upfront saves wasted effort downloading invalid or inaccessible content. For large volumes, consider a distributed link checking system that can test thousands of URLs concurrently.

Tracking Link Changes Over Time

The web is constantly changing, so links you scrape today may go stale over time. To monitor this, track link metrics over your scraping history:

Count of dead 404 links
Redirected links and where they point
New links added
Links removed
Links that changed section or URL path

You can run periodic jobs to re-scrape and compare against your link store. Changes can then be analyzed. This lets you detect shifts like re-architected websites, removed content, changed navigation, and more. Common link changes to expect:

Domain migrations – Site moved to new domain, links now broken
Content removed – Old blog posts/products deprecated, links dead
Site redesign – Pages and content reorganized, links moved
New content – Additional blog/help/category sections added

Alerting on big deltas and changes gives you signal to update any scrapers or downstream consumers relying on those links. For content-heavy sites like news and ecommerce that update daily, automating link change monitoring is highly recommended.

Additional Tips for Working with Links

Here are some additional tips when extracting links with BeautifulSoup:

Handle Broken Links: Some <a> tags may not have a valid href attribute. Use defensive coding to avoid errors:

href = a_tag.get('href')
if href:
  # extract link

Extract Link Text: To get text inside the <a> tag, use a_tag.text:

link_text = a_tag.text

Remove Duplicate Links: Use a Python set to store unique links only:

links = set() # instead of a list

Limit to Certain Sections: You can pass a tag name to find_all() to narrow your search area. For example, to only search within <nav>:

nav_links = soup.nav.find_all('a')

Find by Class/ID: Use a CSS selector like a.some-class or a#some-id to limit matches.

There are many possibilities! Refer to the BeautifulSoup documentation for advanced queries.

Summary

Link extraction serves as the foundation for most robust web scraping projects. BeautifulSoup makes it very easy to parse, navigate, and extract data from HTML and XML in Python. I hope this guide gave you a good understanding of how to use it to extract all the links from a page.