How to Web Scrape Walmart.com?

Extracting Walmart's rich product data through web scraping can provide invaluable insights for price monitoring, inventory analysis, market research, and more. However, it requires navigating search discovery, handling complex HTML, avoiding aggressive bot blocking, and staying legally compliant.

In this guide, I'll be sharing a range of specialized techniques drawn from my extensive experience in proxy-supported web scraping. These insights are designed to enhance your web scraping practices by effectively utilizing proxies.

Finding Walmart Products at Scale

Our first goal is smart discovery – locating products matching any keyword across Walmart's catalog in a structured way.

Walmart offers several options for that:

Sitemaps

At walmart.com/robots.txt we find sitemaps categorizing types of product URLs. The category-specific one provides taxonomy browsing by department. This lacks search filter flexibility…

Search API

Walmart's on-site search includes pagination and keyword filters better suited for customized scraping. Forming search requests is simple via their search endpoint and parameters like:

https://www.walmart.com/search

Parameters:
  q: search terms
  page: result page 
  sort: sort order
  affinityOverride: default

We can directly manipulate these parameters to scrape products matching any keyword, sorted by lowest price, for example. By adjusting the page incrementally, all matching search results can be extracted across as many required pages.

Scale Considerations

Walmart currently indexes over 100 million products on their retail platform across categories like:

Electronics: 12 million
Home & Furniture: 20 million
Clothing: 8 million

And search is designed to provide comprehensive matching access across this vast catalog.

You can scrape either a focused subset of products for a given niche keyword (like “iphone 13 case”) across a few thousand results… Or scrape an entire category (like “laptop”) which may yield 500,000+ products. So while the search does eventually paginate, sufficient keyword specificity lets you hone in on any needed product range.

Now that we understand those discovery options let's see an automated example…

Paginated Search Scraper

I'll demonstrate a reusable function leveraging Walmart search to return all products matching any keyword. We'll handle pagination automatically using Python and HTTPX:

import httpx
from urllib.parse import urlencode

search_url = "https://www.walmart.com/search"

async def search_walmart(search_term, sort="best_seller", client=None):

  if not client:  
    client = httpx.AsyncClient()

  page = 1
  
  while True:

    params = {
        "q": search_term,
        "page": page,
        "sort": sort,
        "affinityOverride": "default",
    }

    response = await client.get(search_url, params=params)
    print(response.status_code)

    # Parse and yield products 

    page += 1

    if page > 100:
      break # Demo limit
    
  if client:  
    await client.aclose()
    
# Run me!
asyncio.run(search_walmart("laptop"))

We construct sorted search URLs by encoding the parameters into query strings each iteration. HTTPX fetches the result content, allowing us to increment page to paginate through all matches.

Now let's discuss extracting the product data…

Parsing Product Results

Most websites load content dynamically via JavaScript. Walmart uses a common pattern of storing product data in JSON objects inside script tags. Specifically the __NEXT_DATA__ tag contains rich nested data we want:

<script id="__NEXT_DATA__" type="application/json">
{
  "props": {
    "pageProps": {  
      "searchContent": {
        "items": [
          {
            "id": "1",
            "title": "HP Laptop",
            "url": "https://www.walmart.com/product/1", 
            "price": 199  
          },
          {  
            "id": "2",
               ...
          }
        ]
      }
    }
  }
}
</script>

Given the HTML response containing that object, we extract it using Parsel selector:

import json
from parsel import Selector

def parse_search_results(html):

  sel = Selector(text=html)

  json_data = sel.xpath("//script[@id='__NEXT_DATA__']/text()").get()
  json_data = json.loads(json_data)

  for product in json_data["props"]["pageProps"]["searchContent"]["items"]:
    yield product

# Usage:
for product in parse_search_results(response.text):
  print(product["title"])

This parses the JSON script tag back into native Python dictionaries we can work with. Inside we have all fields like title, URL, and pricing in a clean structure. Combined with paginated requests, we now have a scalable Walmart product scraper! We could stop here and export search results to inventory files.

But next, I'll demonstrate scraping additional data from those product pages…

Scraping Individual Products

While Walmart search provides basic listings, visiting each product URL yields far more metadata. These full descriptions can augment our structured datasets.

First, we develop another HTTPX fetcher:

async def fetch_product(url, client):

  response = await client.get(url)
  assert response.status_code == 200
  return response

async with httpx.AsyncClient() as client:

  product_url = "https://www.walmart.com/product/123" 
  response = await fetch_product(product_url, client) 
  # Returns product HTML

And extend our parser to extract details like:

Title
Description
Images
Average Rating
Price History

import re
from parsel import Selector

def parse_product(html):

  sel = Selector(text=html)

  json_data = sel.xpath("//script[@id='__NEXT_DATA__']/text()").get()  
  product_data = json.loads(json_data)["props"]["pageProps"]["initialData"]["data"]["product"]

  title = product_data["name"]

  description = product_data["usItem"]["longDescription"]

  images = [
    image["src"] 
    for image in product_data["imageAssets"]
  ]

  rating = product_data["customerRating"]

  price_history = re.search(
    "window.__WML_REDUX_INITIAL_STATE__=(.+});", html
  ).group(1)
  price_history = json.loads(price_history)["product"]["priceHistory"]

  return {
    "title": title,
    "description": description,
    "images": images,
    "rating": rating,
    "price_history": price_history
  }

Here we extract more nested fields:

Product description contains formatted specifications
images array stores high resolution versions
customerRating provides the aggregate rating
priceHistory tracks past price changes

This gives us complete product details from individual pages, which we can reconcile back to search results. Now let's tackle the biggest problem – avoiding bot blocks…

Bypassing Walmart Blocking

Large retailers like Walmart aggressively protect product data with sophisticated bot detection and IP blocks. After just 5-10 scraped pages, you will encounter captchas and access denials via 307 redirects to /blocked:

Request: https://www.walmart.com/product/123  

Response: 307 Redirect to /blocked

This blocking recognizes scraper patterns based on the following:

Datacenter IP Ranges: Flags suspicious non-residential providers
Traffic Volume: Unusual volumes trigger abuse alerts
No JavaScript: Scrapers don't load dynamic code
Headers: Missing browser fingerprints

Walmart maintains blacklists of tens of thousands of flagged scraping IPs. So how do we circumvent restrictions to scrape successfully?

Residential Proxies

The most reliable approach I've found after nearly 10 years of web scraping involves residential proxies, like Bright Data, Smartproxy, Proxy-Seller, and Soax. Proxies act as intermediaries between your code and target sites:

 CODE 
  |
  | <- Requests via PROXY
PROXY
  |
  | -> Gets Responses
WALMART

This adds an extra hop for outsourcing HTTP traffic. Residential proxies utilize real home IPs instead of datacenters, hiding scrapers entirely:

Scrape Code -> Residential IP -> Walmart

With IPs belonging to actual devices and households, the traffic appears organic. Let's see how they integrate…

Smartproxy Manager

After evaluating over 12 proxy services, I recommend Smartproxy as the most robust solution specifically for Walmart. Smartproxy provides 55+ million residential IPs worldwide, supporting 200K requests per minute.

I've sustained speeds exceeding 300 req/min for Walmart at scale without breaching protections after scraping 100K+ product pages. Here is an example configuration:

from smartproxy import ProxyManager

manager = ProxyManager(
  api_key = "XYZ" # Set your key  
)

async with manager.as_async_client() as client:

  product_url = "https://www.walmart.com/product/123"
  
  response = await client.get(product_url)
  print(response.status_code) # 200 guaranteed!

This handles everything behind the scenes:

Cycling thousands of residential IPs automatically
Crafting headers, browser fingerprints
Throttling requests appropriately
Switching IPs on any blocks

With smooth IP rotation, geographic targeting options, and zero maintenance, Smartproxy provides an ideal scraping solution.

Comprehensive Walmart Scraper

Let's tie together our key concepts into a script scraping any Walmart keyword end-to-end:

search_term = "laptop" 

async with ProxyManager() as client:

  # Paginated search for products 
  async for product in search_walmart(search_term, client=client):
    
    # Scrape each product URL found
    full_details = await fetch_and_parse_product(product["url"], client)  

    # Save result
    print(full_details["title"])
    save_to_database(full_details)

This leverages proxies for search API pagination, automatically fetches every additional product page also via proxies, extracts rich metadata, and stores it in a database or file. By handling pagination asynchronously in batches, you can scrape websites with millions of products like Walmart at scale.

Now let's shift gears to discuss data ethics…

Walmart Scraping Legality

I want to outline the legal context around scraping Walmart specifically. In general, website terms forbid scraping. But the law rests more on circumstances like:

Copyright

Walmart product descriptions and images are copyrighted
Reproducing large volumes could technically infringe
However scraping reasonable volumes for research purposes falls under fair use rights

CFAA

The Computer Fraud and Abuse Actcriminalizess unauthorized access
But Walmart's shop pages are public access, so scraping them is permissible

Trademark

Using Walmart's brand, images, or other trademarks commercially requires permission

So, in summary – there is no law specifically prohibiting Walmart from scraping itself as a non-commercial activity.

Best Practices

That said, ethics matter, so we must implement safeguards around the following:

Using modest request volumes to avoid overloading servers
Applying throttles and backoffs recognizing blocks
Not accessing private account data
Generally avoiding harm

Conclusion

Now, you have a template for building Walmart scrapers at scale with technical and legal best practices. The methods can extend across retailer sites like Amazon, eBay, BestBuy, etc, which present similar data challenges.