How to Get ‘Src’ Attribute From ‘Img’ Tag Using Beautifulsoup？

12 Views

Hey there friend! Do you ever come across amazing images online that you want to download? What if you need to gather thousands of assets for a computer vision app? Using Python and Beautifulsoup, you can easily extract image URLs from any website.

In this comprehensive guide, you'll learn how to programmatically find and scrape image sources with just a few lines of code. Whether you're collecting data, downloading assets, or training machine learning models – this tutorial has you covered!

Why Scrape Images from Websites?

Here are some great reasons to extract images from HTML pages:

Download Images – Discover photos online and bulk download them for personal or commercial use with proper rights.

Data Collection – Gather image datasets from the web for machine learning training.

Computer Vision – Build AI models by scraping millions of tagged images for self-driving cars, facial recognition, and more.

Web Scraping – Harvest image assets from sites by programmatically extracting sources.

Content Marketing – Legally use public images for blogs, social media and advertising.

Research – Study images and metadata to detect trends and patterns on the web.

Instead of manually saving assets, automate the process with some simple Python scripts!

Overview

We'll cover the entire process step-by-step:

Make requests to get page HTML
Parse HTML with Beautifulsoup
Find <img> tags in the parsed HTML
Extract the src attribute from each image
Download images by URL to your system

I'll explain each part and show you the code. We'll use real websites for examples you can learn from.

Let's dive in!

Step 1 – Import Required Python Modules

To extract image URLs, we need two Python libraries:

BeautifulSoup – Parses HTML and XML so we can traverse and search documents.

Requests – Sends HTTP requests to fetch web page content.

Add these imports to your code:

from bs4 import BeautifulSoup
import requests

This imports BeautifulSoup classes and the requests module.

Step 2 – Make a Request for the Page HTML

Now we need to get the HTML content of the target webpage that contains the images we want to scrape.

The requests module makes this a breeze. We use the requests.get() method to send a GET request to a URL:

response = requests.get("https://example.com")

This grabs the HTML from example.com and stores it in the response variable.

For this guide, we'll use books.toscrape.com – a sample bookstore site:

response = requests.get("https://books.toscrape.com/")

Now response contains the HTML source of the homepage.

Handling HTTP Errors

Sometimes the request may fail with a 404, 500 or other error. We can check the status code to handle issues:

if response.status_code != 200:
  print("Error - got status code", response.status_code)
else:
  # Proceed with scraping

This displays an error message if the status isn't 200 OK.

To be robust, always check for errors when making HTTP requests!

Step 3 – Parse the HTML with BeautifulSoup

Now that we have the page HTML, we need to parse it into a structure we can traverse.

Beautifulsoup allows us to parse HTML/XML documents. We initialize it with the page content and parser:

soup = BeautifulSoup(response.content, 'html.parser')

This creates a BeautifulSoup object (soup) representing the parsed document.

We can search and navigate the DOM tree using tags, IDs, classes etc.

Why Parse HTML?

Instead of struggling with raw HTML, Beautifulsoup gives us easy ways to access, search, and modify the document.

The key advantages are:

Find elements by name, id, class, CSS selectors
Extract text, attributes, links from tags
Iterate and search the parse tree
Modify the HTML to delete, edit, or add content
Integrate with web scrapers like Scrapy
Faster than regex for HTML manipulation

For our image scraper, it lets us easily locate image tags.

Step 4 – Find All Images in the Parsed HTML

Now we can use BeautifulSoup to extract information from the parsed HTML.

Looking at the books.toscrape.com homepage, we see thumbnail images like:

<img class="thumbnail" src="image1.jpg">

Each thumbnail image has a class of thumbnail.

Beautifulsoup has a find_all() method to search for tags by CSS class. We pass it the tag name img and class thumbnail:

thumbnail_images = soup.find_all('img', class_='thumbnail')

This returns a list of all <img> tags that match the CSS class. The list gets stored in the thumbnail_images variable.

Step 5 – Extract the Src URL from Each Image

The img elements are now in a list. To get the actual image URLs, we loop through them:

for img in thumbnail_images:
  print(img['src'])

This prints the src attribute of each <img> that contains the image filename.

For example, here are some extracted sources:

image1.jpg
image2.png
image3.gif

We now have the path to each image file!

Understanding Image Tag Attributes

The <img> tag contains several standard attributes:

src – Path to the image file alt – Alternate text describing the image width – Width of image in pixels height – Height of image in pixels title – Tooltip text

For downloading images, we just need the src URL. Beautifulsoup lets us easily extract attributes from any tag.

Step 6 – Build Absolute URLs for Downloading

We have the image filenames, but need complete URLs to download them.

We can construct absolute URLs by joining the base site URL with each src value:

for img in thumbnail_images:
  
  # Extract just src from <img> tag
  src = img['src']  
  
  # Build full URL 
  url = 'https://books.toscrape.com/' + src
  
  print(url)

Now we have the direct image URLs:

https://books.toscrape.com/image1.jpg
https://books.toscrape.com/image2.png
https://books.toscrape.com/image3.gif

We can pass these URLs into a downloader to grab each image file.

Step 7 – Download Images with Python

To actually download images, we can use Python's urllib.request module.

It allows downloading files from URLs:

import urllib.request

# Download image from url
urllib.request.urlretrieve(url, "local-filename.jpg")

We pass it the image URL and local filename to save as.

Putting this together with our scraped URLs:

import urllib.request 

for img in thumbnail_images:

  # Get full image URL
  src = img['src']    
  url = 'https://books.toscrape.com/' + src  

  # Download image to files like IMAGENAME.jpg
  urllib.request.urlretrieve(url, src)

This downloads all images into the current directory.

We can also increment a counter to save each one as a numbered file:

count = 1

for img in thumbnail_images:
  
  # Construct image URL
  src = img['src']
  url = 'https://books.toscrape.com/' + src

  # Download image
  urllib.request.urlretrieve(url, str(count) + ".jpg")

  count += 1

And that's it – we've built a complete image scraper with Python!

The full script is available on GitHub.

Next let's look at handling dynamic websites…

Scraping Images from JavaScript Sites

Modern sites often use JavaScript to load content. The raw HTML only contains empty tags that get filled by JS execution in the browser.

To scrape these pages, we need tools like Selenium that can render JavaScript.

Here's a simple Selenium scraper to get image URLs:

from selenium import webdriver

# Initialize Chrome browser
driver = webdriver.Chrome() 

# Load page
driver.get("https://dynamicpage.com")

# Wait for JavaScript to fully load
time.sleep(5)  

# Parse HTML with Beautifulsoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find images as usual
img_tags = soup.find_all('img')

for img in img_tags:
  # Extract and download images

This uses Selenium to load the page and wait for JS to run before parsing with Beautifulsoup.

You can integrate it with the image scraping script to handle any site!

Some key points about dynamic pages:

Use browser automation tools like Selenium or Playwright
Wait for JavaScript to fully render content
Grab rendered HTML from browser and parse as normal
May need to handle infinite scroll, React apps etc.

With a little modification, our basic technique works on any website.

Storing Scraped Images in the Cloud

Once you've scraped a large number of images, you'll want to store them somewhere. Here are some good options:

S3 Buckets – Amazon S3 for cheap, scalable cloud storage
Web Host – Upload to a shared or VPS web host
Google Drive – Easy cloud storage with Python API
MongoDB GridFS – Store files in MongoDB database
Dropbox – Cloud sync & share with API access

For example, to upload images to S3:

import boto3

s3 = boto3.client('s3')

for img in thumbnail_images:

  # Download image
  urllib.request.urlretrieve(url, 'temp.jpg') 
  
  # Upload to S3 bucket
  s3.upload_file('temp.jpg', 'my-bucket', img['src'])

This streams each image into an S3 bucket after downloading.

You can also build a batch process with Celery or Airflow to regularly scrape and upload new images.

The key is automating pipelines to maintain up-to-date datasets.

Scraping Image Metadata and Insights

Beyond downloading raw images, we can also extract metadata for analysis:

from PIL import Image

for img in thumbnail_images:

  # Open image
  i = Image.open(requests.get(img_url, stream=True).raw)

  # Extract metadata
  print(i.format, i.size, i.mode)  

  # Get EXIF data
  exif = {
    PIL.ExifTags.TAGS[k]: v 
    for k, v in i.getexif().items()  
    if k in PIL.ExifTags.TAGS
  }

Using the Python Imaging Library (PIL), we can open images and read data like:

Format (JPEG, PNG, GIF)
Resolution
Color mode
Filesize
EXIF metadata like camera model, GPS, date etc.

This enables powerful image analysis:

Identify broken images
Detect NSFW/offensive pictures
Find images without alt text
Filter low resolution images
OCR text from images
Index scene categories and objects

Scraped visual data can drive computer vision models and unique insights!

Legal Considerations for Image Scraping

It's important to ensure your image scraper respects copyright laws and site terms:

Only download images you have the rights to use. Public domain and creative commons media is safest.
Check the website'srobots.txt settings to see if scraping is allowed. This disallows some user agents.
Limit request rate and resource usage to avoid overloading sites. Slow down crawls.
Don't steal proprietary images or content that requires login/payment to access.
Consult an attorney if building commercial products with scraped data.
Provide attribution for any images used and link back to source.
Be careful of child exploitation material, offensive content and illegal images.
Consider alternatives like purchasing stock photos to avoid legal risks.

In general, limit collections to what you can legally justify needs for – don't over collect data!

Scraping Best Practices

Here are some best practices to ensure successful, robust scraping:

Handle errors – Check status codes, catch exceptions, retry failed requests.
Scaleworkders – Use tools like Scrapy, Selenium Grid, and queues with multiple threads/processes.
Rotate proxies – Switch up IPs to avoid blocks from aggressive sites.
Randomize delays – Vary wait times instead of hitting continuously.
Limit resources – Don't overload sites with huge volumes. Monitor usage.
Cache frequently – Store copies of already-scraped content locally.
Use sitemaps – Harvest listed pages instead of blind crawling.
Obey robots.txt – Respect site owner crawl wishes.
Break into segments – Parallelize scraper by domain, content type etc.

With some care, you can build robust crawlers that provide value instead of causing issues!

Advanced Topic: Training Computer Vision Models with Scraped Images

A key application of web scraping is building machine learning datasets. By scraping tagged images from the web, you can train convolutional neural networks for computer vision.

Some examples:

Face Recognition – Build facial recognition models by scraping social media profile photos matched to identities.
Self-Driving Cars – Crawl street view sites for diverse road images to train steering prediction neural networks.
Medical Imaging – Harvesting disease imaging studies assists in diagnosing illnesses.
Satellite Imaging – Web maps and aerial imagery helps identify objects for geospatial analytics.
Retail – Scrape product images to classify categories and detect attributes.
Ad Targeting – Detect objects and scenes in stock photos to generate contextual advertising.

The key steps are:

Use BeautifulSoup/Selenium to scrape target image categories across websites.
Clean and process images, storing in sorted folders by class.
Label the data. This can be automated through filenames, surrounding text etc.
Feed images and labels into TensorFlow, PyTorch or other libraries to train deep neural networks.
Evaluate model accuracy on test data. Repeat tuning model architecture and parameters.
Deploy final model to make predictions on new images!

With a large corpus of quality labeled images, you can build machine learning applications to process visual data at scale.

Next Steps and Resources

Congratulations – now you know how to extract image URLs from any website with BeautifulSoup!

Some next steps:

Start scraping your own data and downloading unique assets
Build a crawler to maintain image datasets
Contribute to open source libraries like BeautifulSoup
Learn advanced topics like neural networks and computer vision

Here are additional resources for even more skills:

Scrapy – Powerful web scraping framework in Python
Selenium – Browser automation for dynamic pages
Puppeteer – Headless Chrome scraping
Google Images Download – Script for downloading Google Images search results
Downloading Files with Python – Tutorial on urllib and requests modules
Python Image Library (PIL) – Processing images including formats, pixels and metadata
TensorFlow Image Recognition – Train neural networks on image datasets

I hope you enjoyed this guide to scraping images with Python and BeautifulSoup! Let me know if you have any other questions.

Happy coding!