How to Scrape Twitter with Python?

Twitter's treasure trove of public data can provide invaluable insights for research, investing, marketing, and more. But with recent API restrictions limiting access, scraping Twitter through the front-end has become more challenging.

In this comprehensive guide, you'll learn how to use Python and headless browsers to extract Tweets, profiles, topics, and other data from Twitter at scale.

Why Scrape Twitter Data?

Before jumping into the technical details, let's briefly discuss why you may want to scrape Twitter data in the first place.

Sentiment analysis – Analyze tweet sentiment to gauge public reactions to brands, events, or news. Critical for PR.
Algorithmic trading – Build trading signals by analyzing trends and narratives in finance related tweets.
Market research – Understand consumer interests, behaviors, and pain points through tweeted conversations.
Brand monitoring – Identify mentions, trends, and criticisms related to your brand across Twitter.
Sociological research – Study how narratives spread and groups interact through Twitter discourse.
Meme tracking – Analyze the propagation of viral memes and jokes on Twitter.

These are just a few examples of high-value use cases for Twitter data analytics. The platform sees over 500 million tweets per day providing an endless wealth of public information to those who can access it programmatically.

The Challenge of Scraping Twitter

Twitter provides a developer API which allows approved applications to programmatically query tweets and user data. However, over the years Twitter has introduced increasing restrictions limiting API access:

Academic research access requires manual approval.
Free access limited to just 1-2% of total tweet volume.
Paid access starts at $99/mo for 10% volume.

This makes it impractical for individuals and startups to gain sufficient firehose access. Additionally, Twitter aggressively blocks scrapers extracting data at scale from their website.

As an alternative, scraping Twitter by reverse engineering their website API calls provides a way to extract Tweet data without needing an official API key. It comes with challenges however:

Heavy use of complex JavaScript and AJAX calls.
Advanced anti-bot and anti-scraping mechanisms.
Risk of IP blocks if making too many requests.

In this guide, you'll learn strategies and tools to overcome these obstacles and efficiently scrape Twitter data at scale.

Is it Legal to Scrape Twitter?

Before diving into the technical details, brief disclaimer on the legality and ethics of scraping Twitter data.

Twitter's terms prohibit unauthorized automation and scraping. But they rarely pursue legal action against scrapers focused on public data.
It's best to avoid scraping private profiles, direct messages, or deleted content. Stick to what's publicly visible.
Don't overload their systems with an unreasonable number of requests. Be a good web citizen.
Derived analysis and datasets can have enormous value. But be thoughtful about potential harmful uses.

With those caveats in mind, let's look at how to leverage Python and headless browsers to access Twitter's data.

Scraping Twitter Tweets in Python

Individual Twitter pages contain a wealth of data on the tweet itself, attached media, engagement stats, user info, replies, and more. Let's walk through a full code example for scraping tweet data in Python.

Launching a Headless Browser

Twitter pages make heavy use of JavaScript to dynamically load content. To fully render each page, we'll use Playwright to launch a headless Chromium browser:

from playwright.sync_api import sync_playwright

browser = playwright.chromium.launch()

This gives us a browser instance to programmatically navigate and interact with web pages.

Next we'll open a new page and navigate to our target tweet URL:

page = browser.new_page()
page.goto('https://twitter.com/elonmusk/123456788')

Now that we've loaded the full tweet page, we need to intercept the API calls made in the background.

Capturing API Calls with Playwright

Modern websites like Twitter offload much of the work to asynchronous JavaScript calls. Key data is loaded behind the scenes by calling APIs.

To scrape this data, we need to intercept the network calls and parse the responses. Playwright enables this by letting us attach a callback to each response:

responses = []

def capture_response(response):
  responses.append(response)
  return response

page.on('response', capture_response)

Now as the page loads, all responses will be saved to our list. We can search for the ones relevant to this tweet:

for response in responses:
  if 'TweetDetail' in response.url:
    tweet_data = response.json()

The TweetDetail GraphQL call contains all data on the tweet in JSON format. Next we'll parse this into a clean Python object.

Parsing Tweet Data with JMESPath

Twitter returns large complex nested JSON – we want to extract and flatten the most relevant fields. JMESPath makes this easy:

from jmespath import search  

query = """
{
  id: legacy.id_str,
  text: legacy.full_text,
  user: user.legacy.screen_name, 
  likes: legacy.favorite_count
}  
"""

tweet = search(query, tweet_data)
print(tweet)

Running this query against the JSON gives us a nice cleaned tweet object!

We can further enrich by doing more advanced parsing, such as extracting all URLs, decoding emojis, analyzing tweet semantics and more. The data is all there in the raw JSON.

Scraping Asynchronous Tweets for Speed

The above example uses a synchronous approach for simplicity. But we can greatly speed up scraping by making requests asynchronous.

Playwright enables this by using browser_context.wait_for_event():

import asyncio
from playwright.async_api import async_playwright

async def scrape_tweet(browser, url):
  
  page = await browser.new_page()
  responses = []
  
  async def capture(response):
    responses.append(response)
    return response

  page.on('response', capture)

  await page.goto(url)
  await page.wait_for_selector('.tweet') # Wait for page load

  for response in responses:
    if 'TweetDetail' in response.url:
      return response.json()

async def main():
  
  async with async_playwright() as p: 
    browser = await p.chromium.launch()
    urls = ['list', 'of', 'tweets']

    coroutines = [scrape_tweet(browser, url) for url in urls] 
    results = await asyncio.gather(*coroutines)
    
    await browser.close()
    return results
    
print(asyncio.run(main()))

By awaiting multiple scrape_tweet() calls in parallel, we can extract data from hundreds of pages per second! For even higher concurrency, we could run multiple instances of this script in parallel processes.

Scraping Entire Tweet Threads

So far we've focused on single tweets. To extract entire back-and-forth conversations, we'll have to recursively scrape pages until reaching the end of a thread. Twitter's UI provides “Show this thread” buttons we can programmatically click through to navigate a thread.

Here's some psuedo-code demonstrating one approach:

def scrape_thread(url):

  browser = launch_browser()  
  page = browser.new_page()

  page.goto(url)

  while True:
      
    for tweet in parse_tweets(page):
      yield tweet 

    next_button = page.query_selector('.next-tweet')
    if not next_button:
      break
 
    next_button.click()

  browser.close()

for tweet in scrape_thread('https://twitter.com/example/1234'):
  print(tweet)

We iteratively click through all pages, extracting tweets, until the next button disappears indicating the end.

Storing Scraped Tweets

Once we're scraping thousands of tweets, we need an efficient storage strategy. Some options:

JSON – Directly dump JSON responses to disk for simple caching.
PostgreSQL – Schema with TweetID, text, timestamps. Fast and powerful querying.
ElasticSearch – Provides fast text search and analytics. Syncs well with Kibana.
Redis – Ultra fast for simple appending of new tweets. Requires further ingestion.
MongoDB – Flexible schema-less storage ideal for unstructured tweet data.

Choosing a database ultimately depends on your access patterns and analytics needs. But all are preferable to ad-hoc CSV/JSON files at scale.

Scraping Tweets Wrap Up

That covers the foundations of using Python and Playwright to scrape data from individual tweets, threads, and across pages:

Intercept network calls made by page to access API data.
Parse GraphQL responses with libraries like JMESPath.
Scrape asynchronously for high throughput.
Store data efficiently in PostgreSQL, Elasticsearch etc.

Next let's examine how to scrape in-depth data from Twitter user profiles.

Scraping Twitter User Profiles

Beyond individual tweets, Twitter user profiles provide a wealth of data from biographic info, to tweets, to follower stats, and more. Scraping profiles follows the same principles but requires capturing different API responses.

Loading Target Profile

Again we'll start by launching Playwright browser and navigating to profile URL:

browser = playwright.chromium.launch()
page = browser.new_page()
page.goto('https://twitter.com/elonmusk')

Finding User Data APIs

As the page loads, we'll intercept all responses looking for calls to UserBy and UserTweets endpoints:

for response in responses:
  if 'UserBy' in response.url:
     user_data = response.json() 
  elif 'UserTweets' in response.url:
     tweets_data = response.json()

UserBy provides core user metadata, while UserTweets gives recent tweet history.

Parsing Profile Information

Next, we can use JMESPath to query this data and extract fields of interest:

from jmespath import search

user = search({'
  name: legacy.name,
  handle: legacy.screen_name,
  followers: legacy.followers_count,
  tweets: tweet_results[*].id
}, user_data)

recent_tweets = [parse_tweet(t) for t in tweets_data['tweets']]

We now have the user profile information along with their most recent tweets in a structured format.

Enriching Profile Data

Beyond basics like name and followers, we could do more advanced parsing to enrich our user profiles:

Extract location data from scraped profile metadata.
Classify users by persona or organization.
Detect suspicious bot-like activity through analysis of tweet patterns.
Identify influencers by analyzing follower sentiment, engagements etc.

By combining data across many accounts we can build up detailed profiles of users and groups on Twitter of value for marketing and research purposes. When scraping tweets, asynchronous requests can greatly accelerate the scraping process. We can gather profile data for thousands of users simultaneously.

Storing Twitter Profile Data

For storage, we have many of the same options as with tweets:

PostgreSQL – Relational schema with one table for users, one for tweets.
MongoDB – Flexible JSON storage for handling diverse user metadata.
Redis – Rapid appending of new users. Requires additional pipeline for analytics.
Neo4j – Graph database ideal for analyzing relationships between entities.

Graph databases like Neo4j shine for linked social data. Properties on nodes represent profile details, while edges capture connections between accounts. Cypher query language provides powerful analytics over the resulting graph.

User Profile Scraping Wrap Up

In summary, scraping Twitter user profiles follows a similar methodology to tweets:

Intercept UserBy and UserTweets API calls to collect profile and tweet data.
Parse and enrich raw JSON into clean user profiles.
Scale up with asynchronous scraping.
Graph databases like Neo4j excel for storing connected social data.

Next, let's explore how to discover Twitter accounts and content to scrape.

Finding Tweets and Profiles to Scrape

We now have the tools to extract data from Twitter pages. But how do we find tweets and profiles to scrape in the first place?

Avoid Scraping Twitter Search Results

The obvious approach is to search Twitter for keywords of interest, and scraping those results. However this carries significant risks:

Need to authenticate to access search results, violating ToS.
Can easily get account banned for automated search activity.
Twitter aggressively blocks scrapers accessing search pages.

For all these reasons, directly scraping Twitter search results is not recommended.

Scraping Public Twitter Topic Pages

A better option is to scrape tweets from Twitter's public topic pages:

https://twitter.com/i/topics/853980498816679937 # Topic ID for Dogs

These provide recent tweets on a given subject, without needing to authenticate.

We can scrape these pages using the same Playwright techniques covered earlier. Just look for the TopicTimeline API call.

Curating a list of topic IDs to scrape based on your interests provides a constant source of relevant tweets. As topics evolve over time, you can adjust the list accordingly.

Here are some example topic categories and associated ID ranges:

Technology (#1146 – #1199)
Business (#1475 – #1524)
Politics (#1325 – #1399)
Sports (#4467 – #4599)

See the full topic ID list for hundreds more categories.

Finding Users by Industry and Interest

Scraping focused topic pages provide tweets. To find users to scrape, some options:

Extract @-mentions from existing tweets to find associated accounts.
Search sites like TwitInfo by interest and location.
Analyze follower graphs to expand from an initial user set.
Check Twitter Lists for curated account groups.

Topic pages provide a natural feed of tweets matching an interest. To scale user profile scraping, you'll need an approach to programmatically expand your account list. The above approaches can help provide that seed set.

Scraping Emerging and Specialized Communities

While mainstream Twitter provides a large data source, emerging and specialized social platforms can provide unique additional data.

For example, Mastodon instances focusing on tech, science, or art communities. Scraping these niche sites in addition to Twitter provides:

Access to early emerging slang, memes, and topics.
Specialized vocabularies and language.
Early insight into bubbly niche communities.

Stay aware of newer communities and scrape them in conjunction with Twitter for a more rounded dataset.

Avoiding Twitter Blocks and Captchas

If aggressively scraping Twitter from a single IP, you will quickly run into anti-bot protections and blocks. Here are proven strategies to scrape safely at scale:

Understand Twitter Blocking Mechanisms

Twitter actively employs a myriad of tactics to detect and block scrapers and bots. These include:

Browser fingerprinting -Identifying Playwright/Selenium via navigator properties.
IP blocks – Banning scraping infrastructure IP ranges.
Captchas – Manual human verification prompts.
Rate limiting – Restricting API call frequency.
Web hints – Anti-scraper headers injected into responses.

By diversifying IP space and mimicking human behavior, we can mitigate or bypass many of these protections.

Use Proxies to Hide Scraper Identity

The most straightforward approach is routing traffic through a large pool of residential proxy servers, hiding the scraper's true IP.

Here is an example using the BrightData proxy API:

from brightdata.api import Client

api = Client(YOUR_KEY)

while True:

  proxy = api.get_proxy() 
  
  browser = launch_browser(proxy['host'])

  # ... scraping logic ...

  browser.close()

  time.sleep(5) # Don't overload IP!

BrightData provides over 72M constantly changing residential IPs worldwide. By rotating proxies, we can scrape extensively without tripping Twitter's defenses. Stick to residential proxies, as Twitter aggressively blocks datacenter IPs. When scaling proxy usage, take care to add throttling, max retries, and error handling to avoid overwhelming proxies.

Scrape “Humanly” to Avoid Detection

Beyond proxies, we can make our scrapers appear more human by:

Adding random delays between page loads.
Filling out webpage forms before submitting.
Scrolling pages and hovering elements before clicking.
Rotating browser user agents and languages.
Solving captchas manually or using services like AntiCaptcha.

The more your scraping replicates human web browsing patterns, the lower your chance of Twitter blocking it.

Twitter Blocking Mitigation Summary

In summary, rotating residential proxies combined with human-like scraping patterns is key to avoiding Twitter blocks at scale:

Leverage proxy APIs like BrightData to hide scraper IP.
Add delays, mouse movements, and other human elements.
Start small and ramp up over time to avoid sudden spikes.

With the right precautions, you can extract thousands of tweets per day without major issues.

Scraping Twitter with Python – Wrap Up

The techniques covered here provide a structured approach to gathering Twitter data for research, analytics, and more. Just be sure to carefully manage scraping rates and volume to avoid disruptions for both Twitter and your scrapers!