How to Scrape Instagram?

Instagram is one of the fastest growing social media platforms, with over 2 billion monthly active users worldwide as of 2022. With this huge user base generating massive amounts of data daily, it's no wonder that extracting and analyzing Instagram data is becoming increasingly popular.

In this comprehensive guide, you'll learn how to scrape various types of public data from Instagram using Python, including profiles, posts, comments, hashtags, locations, followers, and more.

Why Scrape Instagram Data?

Before we dig into the how-to, let's briefly go over why you may want to scrape Instagram data in the first place:

  • Market Research¬†– Scrape hashtag and location pages to analyze demand for products, trends, and consumer interests.
  • Influencer Marketing¬†– Identify influencers and collect analytics on their content and audience engagement.
  • Social Listening¬†– Monitor brand mentions, analyze user-generated content, and understand public perception.
  • Competitor Research¬†– Track competitors' growth, content strategies, and audience demographics.
  • Analytics¬†– Collect engagement metrics on campaigns and organic content to optimize performance.

Of course, make sure you scrape ethically and follow Instagram's guidelines. Now let's get into the various techniques and tools to extract data.

Parsing Instagram Profile Data

An Instagram user's profile contains a wealth of data – username, follower count, bio info, posts, etc. To scrape it, we'll make a request to the profile URL and parse the HTML:

import requests
from bs4 import BeautifulSoup

url = 'https://www.instagram.com/natgeo/'
response = requests.get(url) 
soup = BeautifulSoup(response.text, 'html.parser')

Now we can extract the data:

username = soup.find('meta', property='og:title').attrs['content']
posts_count = soup.find('span', {'class': 'g47SY'}).text 
bio = soup.find('div', {'class': '-vDIg'}).text.strip()
profile_img = soup.find('img', {'class': 'FFVAD'})['src']
followers = soup.find('a', {'href': f'/{username}/followers/'}).find('span').text

And print it:

print({
  'Username': username,
  'Posts': posts_count,
  'Bio': bio,
  'Profile Image': profile_img, 
  'Followers': followers
})

There are over 25+ elements we can extract like full name, category, contact info, external URLs, and more. Now let's look at scraping data from individual posts.

Scraping Instagram Posts

Posts contain images, videos, captions, comments, likes, and other valuable data. To scrape posts, we'll use Instagram's GraphQL API. First, we need the shortcode of the post from the URL:

post_url = 'https://www.instagram.com/p/BkQ0CcbBiLh/'
shortcode = post_url.split('/')[-2]

Then we can make a POST request to the GraphQL API:

import requests 

url = 'https://www.instagram.com/graphql/query/'
headers = {'User-Agent': 'Mozilla/5.0'}

variables = {
  'shortcode': shortcode,
  'child_comment_count': 3,
  'fetch_comment_count': 40,
  'parent_comment_count': 24,
  'has_threaded_comments': True
}

response = requests.post(url, json={'query_hash': '477b65a610463740ccdb83135b2014db', 'variables': variables}, headers=headers)

This returns a JSON response containing the post data that we can parse:

import json

data = json.loads(response.text)

img_url = data['data']['shortcode_media']['display_url']
caption = data['data']['shortcode_media']['edge_media_to_caption']['edges'][0]['node']['text']
likes = data['data']['shortcode_media']['edge_media_preview_like']['count']
comments = data['data']['shortcode_media']['edge_media_to_parent_comment']['count']

print({
  'Image URL': img_url,
  'Caption': caption,
  'Likes': likes,
  'Comments': comments  
})

There are 70+ elements available like video URLs, hashtags, location, tagged users, and more. Now let's see how to scrape posts from hashtags and locations.

Scraping Hashtags and Locations

Hashtags and locations have public pages displaying top posts. Here's how to scrape hashtag posts:

import requests
from bs4 import BeautifulSoup

base = 'https://www.instagram.com/explore/tags/' 
hashtag = 'nature'
response = requests.get(base + hashtag)
soup = BeautifulSoup(response.text, 'html.parser')

posts = soup.find_all('a', class_='_9AhH0')
post_urls = [post['href'] for post in posts[:12]]

We can loop through the post URLs and extract info using the GraphQL endpoint. To scrape location pages, replace the base URL:

base = 'https://www.instagram.com/explore/locations/'
loc_id = 234249451 # ID of Los Angeles 
response = requests.get(base + loc_id)
soup = BeautifulSoup(response.text, 'html.parser')

# Get post URLs

Now let's discuss how to scrape comments.

Scraping Instagram Comments

To scrape comments, we'll access the post page directly:

post_url = 'https://www.instagram.com/p/BkQ0CcbBiLh/'  

response = requests.get(post_url)
soup = BeautifulSoup(response.text, 'html.parser')  
comments = soup.find_all('div', {'class': 'C4VMK'})

Then we can loop through the comments:

for comment in comments:
  text = comment.find('span').text
  likes = comment.find('button')['aria-label'] 
  print(f"{text} - {likes}")

This will print the text and like count for each comment on the page. Pagination is required to get all comments.

Next let's look at scraping followers and who a user is following.

Scraping Followers and Following

To scrape a user's followers or following, we'll iterate through the pages:

import time

base = 'https://www.instagram.com/'
user = 'natgeo'  

url = f"{base}{user}/followers/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

followers = []
next_data = soup.find('a', class_='_5f5mN')['href'] 

while next_data:
  response = requests.get(base + next_data)
  soup = BeautifulSoup(response.text, 'html.parser')

  for user in soup.find_all('a', class_='FPmhX'):
    followers.append(user.text)

  next_data = soup.find('a', class_='_5f5mN')['href'] if soup.find('a', class_='_5f5mN') else None
  time.sleep(2)
  
print(followers)

We loop through each page, extracting follower usernames and paginating using the next_data endpoint. A sleep is added to avoid overwhelming Instagram. To get who the user is following, update the initial URL. There are a few additional tips worth mentioning when scraping Instagram.

Additional Tips for Scraping Instagram

Here are some useful tips to bear in mind when scraping Instagram:

  • Use random delays¬†between 3-10 seconds per request to mimic human behavior and avoid detection. Scraping too fast can get your IP banned.
  • Rotate proxies and residential IPs to make requests appear from different geographic locations and bypass IP blocks. Such as Bright Data, Smartproxy, Soax, andProxy-Sellers.
  • Handle captchas¬†that occur by automatically solving reCAPTCHA challenges or manually completing them.
  • Check for blocks¬†– monitor for 403 and 503 errors, redirect loops, or captcha pages that signal blocks.
  • Render JavaScript¬†using Selenium to load dynamic content that may not be in static HTML.
  • Parse JSON data¬†as most content is returned as JSON rather than HTML through Instagram's APIs.
  • Follow robots.txt¬†guidelines to avoid aggressively scraping and getting banned.

Now let's go over some best practices for creating effective and resilient Instagram scrapers.

Scraping Best Practices

Here are some tips for avoiding bans and building sustainable Instagram scrapers:

  • Limit request rate¬†to a few hundred requests per day and use random delays of 3-10+ seconds. Going too fast is an easy way to get blocked.
  • Spread requests¬†over multiple days and hours rather than bombarding all at once. This helps avoid sudden spikes in traffic.
  • Use a random user agent¬†on each request so it appears to come from different devices and browsers.
  • Rotate proxies and IPs¬†to prevent blocks on individual endpoints. Residential IPs work better than data center IPs.
  • Implement captchas solving¬†to detect and bypass captcha challenges. Completing them manually is another option.
  • Check for blocks¬†and employ evasion methods like rotating endpoints, solving captchas, or using proxy services.
  • Obfuscate scraping¬†through plugins like Burner or methods like web session simulation to appear more human.
  • Make requests from different geographic regions¬†which is more natural than thousands from a single area.

Following these precautions will help reduce the risk of issues. Next let's look at some real-world examples of people leveraging Instagram scrapers.

Real-World Use Cases

Here are just a few examples of companies and individuals extracting value from Instagram data:

  • Social media analytics companies like¬†Socialinsider¬†use scrapers to collect engagement metrics for their clients' Instagram accounts. This helps optimize content strategy.
  • Hypeauditor¬†uses Instagram scrapers as part of their influencer marketing platform to analyze audience demographics, engagement rates, and campaign performance.
  • Social media monitoring¬†companies like¬†Mention¬†scrape brand mentions across social networks like Instagram to generate reports for their customers.
  • eCommerce brands¬†scrape their competitors on Instagram to benchmark performance and analyze what content resonates most with their target audience.
  • Researchers¬†and¬†academics¬†utilize Instagram scrapers to study trends around usage, growth, user behavior, misinformation, and more across different demographics.
  • The¬†travel industry¬†scrapes location and hashtag pages to identify trending destinations and optimize marketing campaigns.

There are countless possibilities for collecting and leveraging public Instagram data ethically.

Ethical Considerations for Scraping Instagram

While most data on Instagram is considered public, it's important to keep ethics in mind:

  • Only scrape data you¬†actually plan to use. Mass collection for no purpose wastes resources.
  • Be transparent in your privacy policy if collecting any¬†personally identifiable data.
  • Enable users to¬†opt-out¬†if you'll be storing their data long term.
  • Use rate limiting, delays, and respectful collection to avoid impacting Instagram's infrastructure.
  • Always follow Instagram's¬†Terms of Service¬†and restrict scraping to solely public data sources.
  • Consider¬†anonymizing¬†collected data by removing usernames if sharing publicly.
  • Analyze if your use case could risk¬†user privacy¬†or cause harm before collecting data.

Adhering to responsible data practices ensures you stay on the right side of Instagram's guidelines and avoids unethical use cases.

Powerful Python Libraries for Scraping Instagram

There are a number of excellent Python libraries that are useful for scraping Instagram data:

  • Requests¬†– Simplifies making HTTP requests to APIs and pages. Used to fetch profile data, posts, etc.
  • BeautifulSoup¬†– Parses HTML and helps extract specific elements from response content. Great for scraping profile info.
  • Selenium¬†– Launches an automated browser which can render JavaScript. Helpful for content loaded dynamically.
  • Pyppeteer¬†– Asynchronous web scraping library for interacting with pages in a headless Chrome browser.
  • Scrapy¬†– Fast web scraping framework with built-in tools like asynchronous requests, proxies, and caching.
  • json¬†/¬†jmespath¬†– Help parse and extract fields from Instagram's JSON responses.
  • Ratelimit¬†– Useful library to implement delays between requests and avoid overwhelming sites.

Each has different strengths depending on your use case. Combining libraries like Requests and BeautifulSoup is powerful for most scrapers.

Troubleshooting Common Scraping Issues

There are a few common issues that may arise when scraping Instagram along with fixes:

  • Blocks¬†– Use proxies, residential IPs, random delays, and user agents. Services like Scrapfly can also help bypass some blocks.
  • CAPTCHAs¬†– Automatically solve reCAPTCHA challenges with integrations or manually complete them.
  • Bans¬†– Rotate IPs and reduce scraping frequency/intensity to get unbanned. Avoid spammy behavior.
  • SSL errors¬†– May need to update SSL certificates on your end or try a new IP if blocked.
  • Blank responses¬†– Check for blocks or try rerouting traffic through a proxy.
  • Missing data¬†– Render JS using Selenium or double check if endpoint still exists.
  • Rate limits¬†– Implement delays between requests and spread over longer durations.
  • Changes¬†– Review documentation for deprecation notices or check for new tokens. Adapt scrapers accordingly.

Adopting good habits like scraping respectfully and using proxies minimizes the likelihood of major issues occurring.

Conclusion

I hope this guide provided you with a comprehensive overview of the many possibilities for extracting value from public Instagram data. The techniques covered, from profile parsing to hashtag scraping, demonstrate how much can be accomplished with just Python. Of course, be mindful of Instagram's guidelines and scrape ethically. Implementing best practices around proxies, delays, and randomization helps avoid issues.

Leon Petrou
We will be happy to hear your thoughts

      Leave a reply

      Proxy-Zone
      Compare items
      • Total (0)
      Compare
      0