How to Scrape Aliexpress.com?

Hey there! Are you looking to extract data from AliExpress for business intelligence, machine learning, or research? If so, you’re in the right place.

Aliexpress is one of the biggest global e-commerce platforms, with millions of products and reviews that can provide invaluable data for market research, business intelligence, and dropshipping. In this comprehensive guide, we'll walk through how to build a web scraper to extract Aliexpress product data and reviews using Python.

Why Scrape Aliexpress?

Here are some of the key reasons you may want to scrape data from Aliexpress:

Market Research: Analyze top products, pricing trends, and demand signals for your own market research.
Competitive Intelligence: Keep tabs on competitors' pricing, inventory levels, new products etc.
Dropshipping: Curate product catalogs for dropshipping businesses.
Machine Learning: Source data to train AI models e.g. for visual search or demand forecasting.
Reseller Arbitrage: Identify profitable products to source and resell from China.

So whether you need product data for business intelligence, machine learning training, or building a dropshipping store, web scraping provides access to Aliexpress's rich data.

Tools You'll Need

We'll use Python for this tutorial, as it's one of the most popular languages for web scraping. Here are the key Python packages we'll utilize:

BeautifulSoup – A very popular HTML parsing library to extract data from the downloaded web pages.
requests – Provides simple APIs to send HTTP requests to fetch page content.
pandas – Helpful for data analysis and storage after extracting data.
Selenium – For rendering JavaScript heavy pages that can't be parsed statically.
API Clients – For added scalability, caching and proxy support. (eg. ScrapingBee, ScrapeHero, Scrapfly)

I'd recommend using a Python virtualenv to install these:

pip install beautifulsoup4 requests pandas selenium scrapingbee

Optionally, Jupyter Notebooks are also great for experimenting and iterating on scrapers interactively. Alright, with our tools installed, let's start scraping!

Scraping AliExpress Product Search

Our first goal is to build a scraper that can extract all the products matching a search query across multiple pages.

This involves:

Sending search request to AliExpress
Parsing total matches and page size
Paginating through each results page
Extracting product listings

Here is a Python function to scrape a search query:

from bs4 import BeautifulSoup
import requests

def scrape_search(query, pages=5):

  print(f'Scraping {query} for {pages} pages') 

  products = []

  for page in range(1, pages+1):

    print(f'Scraping page {page}')

    params = {'SearchText': query, 'page': page}
    r = requests.get('https://www.aliexpress.com/wholesale', params=params)

    soup = BeautifulSoup(r.text, 'lxml')

    total = soup.find('div', {'class': 'total-result'}).text.split(' ')[0].replace(',', '')
    per_page = int(soup.select_one('.ui-pagination-pagesize').text)

    for item in soup.select('.list-item'):
      
      title = item.select_one('.infos').text
      url = item.select_one('.detail-link')['href']  
      image = item.select_one('.image img')['src']
      price = float(item.select_one('.values .value').text[1:])

      product = {
        'title': title,
        'url': url, 
        'image': image,
        'price': price
      }

      products.append(product)

  print(f'Found {len(products)} products for {query}')

  return products

Here's how it works step-by-step:

Send request to AliExpress search URL with the query and pagination.
Parse the HTML response using BeautifulSoup.
Use CSS selectors to extract total results and page size.
Loop through each product div, extracting details into a Python dict.
Return the list of extracted products.

Now let's test it out on a sample search:

products = scrape_search('usb flash drive', pages=2)

print(len(products)) # 80
print(products[0])

'''
{
  'title': 'Original 32GB 64GB 128GB 256GB U Disk 2.0 USB Flash Drive...',
  'url': 'https://www.aliexpress.com/item/1005004623778029.html', 
  'image': 'https://ae01.alicdn...',
  'price': 7.52
}
'''

It extracts 80 products across 2 pages! This paginated scraper can be extended to extract thousands of products for any query.

Scraping Results at Scale

When scraping large volumes, here are some tips:

Set a higher pages limit, but keep request rate in check
Use multiple threads/async to scrape faster
Add random delays between requests
Rotate user agents and proxies to avoid blocks
Leverage data APIs like ScrapingBee or ScrapeHero for caching and proxies

Let's look at how we can integrate APIs like ScrapingBee to scale up:

import scrapingbee

api = scrapingbee.ScrapingBeeClient(api_key) 

for page in range(1, pages+1):
  
  url = f'https://aliexpress.com/wholesale?SearchText={query}&page={page}'

  html = api.get(url).content
  soup = BeautifulSoup(html, 'lxml')

  # Rest of scraping logic..

APIs provide pre-scraped HTML responses out of the box along with proxies and regional servers to prevent blocks.

This makes it easy to scale up AliExpress scraping to large workloads.

Scraping Product Details

Once we’ve found relevant products, the next step is extracting details from the product page like:

Title / Description
Specifications
Pricing information
Images
Seller info
Shipping options

This structured data is useful for market analysis, machine learning training, and dropshipping store creation. Here is how we can scrape product details:

from bs4 import BeautifulSoup
import requests

def scrape_product(url):

  response = requests.get(url)

  soup = BeautifulSoup(response.text, 'lxml')

  product = {
    'title': soup.find('h1', id='product-name').text.strip(), 
    'description': soup.find('div', id='j-product-desc').text.strip(),
    'images': [],
    'options': [],
    'skus': [] 
  } 

  # Extract image urls
  for img in soup.find('ul', id='j-image-list').findAll('img'):
    product['images'].append(img['src'])

  # Get product options
  for li in soup.select('#j-product-info .attributes-list li'):
    label = li.select_one('.attribute-label').text  
    values = [x.text for x in li.select('.attribute-values .value')]
    product['options'].append({label: values})

  # Extract pricing tables
  for row in soup.find('table', class_='price-table').findAll('tr'):
    cells = row.findAll('td')
    sku = cells[1].text  
    price = float(cells[2].text[1:])
    product['skus'].append({
        'sku': sku,
        'price': price 
    })

  return product

The key steps are:

Send GET request to product URL
Parse HTML using BeautifulSoup
Use IDs and CSS selectors to extract product data
Handle nested data like images and pricing tables
Return product dict containing all details

Let's try it on a sample product:

url = 'https://www.aliexpress.com/item/1005004567746558.html'
product = scrape_product(url)

print(product['title'])
# USB C HUB

print(len(product['images'])) # 5 
print(product['skus'][0])
# {'sku': '1005004567746558', 'price': 9.99}

This scraper extracts even tricky nested data like multi-variant pricing cleanly in a structured format. With a little modification, we can easily scrape thousands of AliExpress products by feeding in product links that we extracted earlier.

Helpful Tips for Scraping Products

Here are some tips for handling product pages effectively based on my experience:

Product IDs often stay unchanged even if the product URL changes. So save the ID to scrape again later.
New products are added very frequently. So regularly re-scrape search results to find new items.
Use Selenium instead of requests when encountering highly dynamic JavaScript rendering.
Extracting product specs can be tricky – look for hidden JSON data on the page for structured data.
Saving HTML locally allows inspecting pages more thoroughly when debugging scrapers.

Scraping Reviews

Product reviews provide great insight into product fit, sizing, shipping and more based on actual buyer experiences. However reviews are loaded via JavaScript, so we can't scrape them directly. Here's how to render them using Python:

from requests_html import HTMLSession

session = HTMLSession()

def scrape_reviews(url):
  
  response = session.get(url)

  reviews = []

  for review in response.html.find('div.feedback-list', first=False):

    stars = len(review.find('.star-view', first=True).attrs['style'].split(' '))  
    date = review.find('.date', first=True).text
    country = review.find('.user-country b', first=True).text
    text = review.find('div.buyer-feedback', first=True).text

    review = {
        'stars': stars,
        'date': date,
        'country': country,
        'text': text
    }

    reviews.append(review)

  return reviews

Instead of requests, we use requests-html to render JavaScript generated content. The key steps are:

Use HTMLSession to get a rendered webpage
Find all the .feedback-list divs containing reviews
Extract details like stars, date, location into a dict
Return list of review dicts

Now let's try it on a sample product:

url = 'https://www.aliexpress.com/item/1005004567746558.html'

reviews = scrape_reviews(url)

print(len(reviews)) # 12
print(reviews[0])

'''
{
  'stars': 5,
  'date': '01-20-2023',
  'country': 'Poland', 
  'text': 'These are great, fit the description'  
}
'''

This approach extracts all reviews nicely despite them being loaded dynamically via JS.

Scraping Large Volumes of Reviews

When collecting reviews at scale, here are some tips:

Use a pool of proxies to distribute requests and avoid blocks
Add random delays between review pagination requests
Sort reviews newly added first to collect latest feedback
Filter reviews programmatically by star rating, country etc.
Save HTML locally when encountering captcha checks
Use a JS rendering service like ScrapingBee or ScrapeHero to scale up

Web scrapers can extract thousands of AliExpress reviews, but always throttle requests and mimic organic browsing behavior.

Storing Extracted Data

Once we've built scrapers to extract products, details and reviews, we need to store the scraped data. For small datasets, JSON is an easy option:

import json

# Search Results
with open('search_results.json', 'w') as f:
  json.dump(products, f)

# Product Details
with open('products.json', 'w') as f:  
  json.dump(products, f)

# Reviews
with open('reviews.json', 'w') as f:
  json.dump(reviews, f)

For larger datasets, databases like PostgreSQL or MongoDB are better suited using an ORM like SQLAlchemy.

import sqlalchemy

engine = sqlalchemy.create_engine(DATABASE_URL)
connection = engine.connect()

# Save to SQL table
products_table = sqlalchemy.Table('products', metadata,
  sqlalchemy.Column('title', sqlalchemy.String),
  # Rest of columns.. 
)

query = sqlalchemy.insert(products_table)
connection.execute(query, products)

This allows saving directly to relational databases for more robust storage. Additionally, Pandas provides handy data analysis features:

import pandas as pd 

# Save CSV
df = pd.DataFrame(products)
df.to_csv('products.csv')

# Calculate stats
print(df['price'].mean())

Pandas enables statistical analysis, visualization and machine learning on extracted datasets.

Avoiding Blocks

When scraping heavily, you may encounter bot protection measures like:

Access denied errors
CAPTCHAs
IP blocks

Here are some tips to avoid blocks based on my experience:

Slow down – Keep requests to under 1 per second. Randomize delays between requests.
Rotate Proxies – Use a pool of residential proxies to distribute requests. Such as Bright Data, Smartproxy, Proxy-Seller, and Soax.
Randomize headers – Change user agent, referer etc with each request.
Use mobile user agents – Mobile scrapers tend to get blocked less often.
Browser automation – Selenium with proxies can mimic organic browsing.
Cloud scraping APIs – Services like ScraperApi, ScrapingBee, ScrapeHero, etc help prevent blocks.

The key is to mimic human browsing patterns and not overload servers with too many rapid requests. With some care, you can extract data from AliExpress at scale without issues!

Is Web Scraping AliExpress Legal?

As long as you follow some basic guidelines, web scraping AliExpress is perfectly legal:

Only extract publicly available data
Don’t try to circumvent their access restrictions
Scrape at reasonable volumes that don’t overload servers
Don’t copy large verbatim portions of their content
Comply with their Terms of Service

The data we'll extract – product details, pricing, reviews etc. is all public information intended to be viewed by site visitors. As long as you scrape respectfully and don’t excessively overload their servers, extracting data for research or business intelligence purposes is legally permissible.

Of course, always consult a lawyer for legal advice on your specific use case. But generally, non-intrusive web scraping is considered comparable to viewing pages manually.

Let's Recap

Scraping tools like Scrapy and Puppeteer can also help, but the techniques we covered should provide you with a very solid AliExpress web scraping foundation. Now you have everything you need to extract tons of useful data from AliExpress. Happy scraping!