Hey there! Are you looking to extract data from AliExpress for business intelligence, machine learning, or research? If so, you’re in the right place.
Aliexpress is one of the biggest global e-commerce platforms, with millions of products and reviews that can provide invaluable data for market research, business intelligence, and dropshipping. In this comprehensive guide, we'll walk through how to build a web scraper to extract Aliexpress product data and reviews using Python.
Why Scrape Aliexpress?
Here are some of the key reasons you may want to scrape data from Aliexpress:
- Market Research: Analyze top products, pricing trends, and demand signals for your own market research.
- Competitive Intelligence: Keep tabs on competitors' pricing, inventory levels, new products etc.
- Dropshipping: Curate product catalogs for dropshipping businesses.
- Machine Learning: Source data to train AI models e.g. for visual search or demand forecasting.
- Reseller Arbitrage: Identify profitable products to source and resell from China.
So whether you need product data for business intelligence, machine learning training, or building a dropshipping store, web scraping provides access to Aliexpress's rich data.
Tools You'll Need
We'll use Python for this tutorial, as it's one of the most popular languages for web scraping. Here are the key Python packages we'll utilize:
- BeautifulSoup – A very popular HTML parsing library to extract data from the downloaded web pages.
- requests – Provides simple APIs to send HTTP requests to fetch page content.
- pandas – Helpful for data analysis and storage after extracting data.
- Selenium – For rendering JavaScript heavy pages that can't be parsed statically.
- API Clients – For added scalability, caching and proxy support. (eg. ScrapingBee, ScrapeHero, Scrapfly)
I'd recommend using a Python virtualenv to install these:
pip install beautifulsoup4 requests pandas selenium scrapingbee
Optionally, Jupyter Notebooks are also great for experimenting and iterating on scrapers interactively. Alright, with our tools installed, let's start scraping!
Scraping AliExpress Product Search
Our first goal is to build a scraper that can extract all the products matching a search query across multiple pages.
This involves:
- Sending search request to AliExpress
- Parsing total matches and page size
- Paginating through each results page
- Extracting product listings
Here is a Python function to scrape a search query:
from bs4 import BeautifulSoup import requests def scrape_search(query, pages=5): print(f'Scraping {query} for {pages} pages') products = [] for page in range(1, pages+1): print(f'Scraping page {page}') params = {'SearchText': query, 'page': page} r = requests.get('https://www.aliexpress.com/wholesale', params=params) soup = BeautifulSoup(r.text, 'lxml') total = soup.find('div', {'class': 'total-result'}).text.split(' ')[0].replace(',', '') per_page = int(soup.select_one('.ui-pagination-pagesize').text) for item in soup.select('.list-item'): title = item.select_one('.infos').text url = item.select_one('.detail-link')['href'] image = item.select_one('.image img')['src'] price = float(item.select_one('.values .value').text[1:]) product = { 'title': title, 'url': url, 'image': image, 'price': price } products.append(product) print(f'Found {len(products)} products for {query}') return products
Here's how it works step-by-step:
- Send request to AliExpress search URL with the query and pagination.
- Parse the HTML response using BeautifulSoup.
- Use CSS selectors to extract total results and page size.
- Loop through each product div, extracting details into a Python dict.
- Return the list of extracted products.
Now let's test it out on a sample search:
products = scrape_search('usb flash drive', pages=2) print(len(products)) # 80 print(products[0]) ''' { 'title': 'Original 32GB 64GB 128GB 256GB U Disk 2.0 USB Flash Drive...', 'url': 'https://www.aliexpress.com/item/1005004623778029.html', 'image': 'https://ae01.alicdn...', 'price': 7.52 } '''
It extracts 80 products across 2 pages! This paginated scraper can be extended to extract thousands of products for any query.
Scraping Results at Scale
When scraping large volumes, here are some tips:
- Set a higher
pages
limit, but keep request rate in check - Use multiple threads/async to scrape faster
- Add random delays between requests
- Rotate user agents and proxies to avoid blocks
- Leverage data APIs like ScrapingBee or ScrapeHero for caching and proxies
Let's look at how we can integrate APIs like ScrapingBee to scale up:
import scrapingbee api = scrapingbee.ScrapingBeeClient(api_key) for page in range(1, pages+1): url = f'https://aliexpress.com/wholesale?SearchText={query}&page={page}' html = api.get(url).content soup = BeautifulSoup(html, 'lxml') # Rest of scraping logic..
APIs provide pre-scraped HTML responses out of the box along with proxies and regional servers to prevent blocks.
This makes it easy to scale up AliExpress scraping to large workloads.
Scraping Product Details
Once we’ve found relevant products, the next step is extracting details from the product page like:
- Title / Description
- Specifications
- Pricing information
- Images
- Seller info
- Shipping options
This structured data is useful for market analysis, machine learning training, and dropshipping store creation. Here is how we can scrape product details:
from bs4 import BeautifulSoup import requests def scrape_product(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'lxml') product = { 'title': soup.find('h1', id='product-name').text.strip(), 'description': soup.find('div', id='j-product-desc').text.strip(), 'images': [], 'options': [], 'skus': [] } # Extract image urls for img in soup.find('ul', id='j-image-list').findAll('img'): product['images'].append(img['src']) # Get product options for li in soup.select('#j-product-info .attributes-list li'): label = li.select_one('.attribute-label').text values = [x.text for x in li.select('.attribute-values .value')] product['options'].append({label: values}) # Extract pricing tables for row in soup.find('table', class_='price-table').findAll('tr'): cells = row.findAll('td') sku = cells[1].text price = float(cells[2].text[1:]) product['skus'].append({ 'sku': sku, 'price': price }) return product
The key steps are:
- Send GET request to product URL
- Parse HTML using BeautifulSoup
- Use IDs and CSS selectors to extract product data
- Handle nested data like images and pricing tables
- Return product dict containing all details
Let's try it on a sample product:
url = 'https://www.aliexpress.com/item/1005004567746558.html' product = scrape_product(url) print(product['title']) # USB C HUB print(len(product['images'])) # 5 print(product['skus'][0]) # {'sku': '1005004567746558', 'price': 9.99}
This scraper extracts even tricky nested data like multi-variant pricing cleanly in a structured format. With a little modification, we can easily scrape thousands of AliExpress products by feeding in product links that we extracted earlier.
Helpful Tips for Scraping Products
Here are some tips for handling product pages effectively based on my experience:
- Product IDs often stay unchanged even if the product URL changes. So save the ID to scrape again later.
- New products are added very frequently. So regularly re-scrape search results to find new items.
- Use Selenium instead of requests when encountering highly dynamic JavaScript rendering.
- Extracting product specs can be tricky – look for hidden JSON data on the page for structured data.
- Saving HTML locally allows inspecting pages more thoroughly when debugging scrapers.
Scraping Reviews
Product reviews provide great insight into product fit, sizing, shipping and more based on actual buyer experiences. However reviews are loaded via JavaScript, so we can't scrape them directly. Here's how to render them using Python:
from requests_html import HTMLSession session = HTMLSession() def scrape_reviews(url): response = session.get(url) reviews = [] for review in response.html.find('div.feedback-list', first=False): stars = len(review.find('.star-view', first=True).attrs['style'].split(' ')) date = review.find('.date', first=True).text country = review.find('.user-country b', first=True).text text = review.find('div.buyer-feedback', first=True).text review = { 'stars': stars, 'date': date, 'country': country, 'text': text } reviews.append(review) return reviews
Instead of requests, we use requests-html
to render JavaScript generated content. The key steps are:
- Use HTMLSession to get a rendered webpage
- Find all the
.feedback-list
divs containing reviews - Extract details like stars, date, location into a dict
- Return list of review dicts
Now let's try it on a sample product:
url = 'https://www.aliexpress.com/item/1005004567746558.html' reviews = scrape_reviews(url) print(len(reviews)) # 12 print(reviews[0]) ''' { 'stars': 5, 'date': '01-20-2023', 'country': 'Poland', 'text': 'These are great, fit the description' } '''
This approach extracts all reviews nicely despite them being loaded dynamically via JS.
Scraping Large Volumes of Reviews
When collecting reviews at scale, here are some tips:
- Use a pool of proxies to distribute requests and avoid blocks
- Add random delays between review pagination requests
- Sort reviews newly added first to collect latest feedback
- Filter reviews programmatically by star rating, country etc.
- Save HTML locally when encountering captcha checks
- Use a JS rendering service like ScrapingBee or ScrapeHero to scale up
Web scrapers can extract thousands of AliExpress reviews, but always throttle requests and mimic organic browsing behavior.
Storing Extracted Data
Once we've built scrapers to extract products, details and reviews, we need to store the scraped data. For small datasets, JSON is an easy option:
import json # Search Results with open('search_results.json', 'w') as f: json.dump(products, f) # Product Details with open('products.json', 'w') as f: json.dump(products, f) # Reviews with open('reviews.json', 'w') as f: json.dump(reviews, f)
For larger datasets, databases like PostgreSQL or MongoDB are better suited using an ORM like SQLAlchemy.
import sqlalchemy engine = sqlalchemy.create_engine(DATABASE_URL) connection = engine.connect() # Save to SQL table products_table = sqlalchemy.Table('products', metadata, sqlalchemy.Column('title', sqlalchemy.String), # Rest of columns.. ) query = sqlalchemy.insert(products_table) connection.execute(query, products)
This allows saving directly to relational databases for more robust storage. Additionally, Pandas provides handy data analysis features:
import pandas as pd # Save CSV df = pd.DataFrame(products) df.to_csv('products.csv') # Calculate stats print(df['price'].mean())
Pandas enables statistical analysis, visualization and machine learning on extracted datasets.
Avoiding Blocks
When scraping heavily, you may encounter bot protection measures like:
- Access denied errors
- CAPTCHAs
- IP blocks
Here are some tips to avoid blocks based on my experience:
- Slow down – Keep requests to under 1 per second. Randomize delays between requests.
- Rotate Proxies – Use a pool of residential proxies to distribute requests. Such as Bright Data, Smartproxy, Proxy-Seller, and Soax.
- Randomize headers – Change user agent, referer etc with each request.
- Use mobile user agents – Mobile scrapers tend to get blocked less often.
- Browser automation – Selenium with proxies can mimic organic browsing.
- Cloud scraping APIs – Services like ScraperApi, ScrapingBee, ScrapeHero, etc help prevent blocks.
The key is to mimic human browsing patterns and not overload servers with too many rapid requests. With some care, you can extract data from AliExpress at scale without issues!
Is Web Scraping AliExpress Legal?
As long as you follow some basic guidelines, web scraping AliExpress is perfectly legal:
- Only extract publicly available data
- Don’t try to circumvent their access restrictions
- Scrape at reasonable volumes that don’t overload servers
- Don’t copy large verbatim portions of their content
- Comply with their Terms of Service
The data we'll extract – product details, pricing, reviews etc. is all public information intended to be viewed by site visitors. As long as you scrape respectfully and don’t excessively overload their servers, extracting data for research or business intelligence purposes is legally permissible.
Of course, always consult a lawyer for legal advice on your specific use case. But generally, non-intrusive web scraping is considered comparable to viewing pages manually.
Let's Recap
Scraping tools like Scrapy and Puppeteer can also help, but the techniques we covered should provide you with a very solid AliExpress web scraping foundation. Now you have everything you need to extract tons of useful data from AliExpress. Happy scraping!