How to Get Page Source in Selenium?

17 Views

When doing web scraping or testing web pages using Selenium, one common task is to retrieve the full page source code (the full HTML) of the website you are interacting with. The page source allows you to see and parse the full structure and content of the page using tools like BeautifulSoup in Python.

In this comprehensive guide, I'll explain multiple methods and best practices for how to get the page source in Selenium using Python, Java, C#, and other languages.

Overview of Getting Page Source in Selenium

The high level steps to get page source in Selenium are:

Initialize a Selenium WebDriver instance and navigate to a URL
Call driver.page_source to get the raw HTML source code
Parse or process the HTML as needed for your application

However, there are some important caveats:

The page source may not be fully loaded when you retrieve it – you often need to wait for elements or JavaScript to load
The source won't include dynamically generated content after page load
There are differences in how page source works across WebDriver languages and browsers

I'll cover all these topics in detail below, including code examples and best practices.

Getting Page Source in Python Selenium

Python is one of the most popular languages for using Selenium WebDriver for web scraping and automation. Here is an example Python script that initializes a ChromeDriver instance, navigates to a URL, and prints out the page source:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.example.com")

print(driver.page_source)

The key line is driver.page_source which returns the raw HTML source code as a string. However, this example has some major limitations:

It will print the source immediately after navigating before the page is fully loaded
It won't include any JavaScript-generated content
It won't wait for any XHR/fetch requests that may contain additional data

Here is an improved version that waits for the page to fully load before getting the source:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get("http://www.example.com")

# Wait for page to fully load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "myDynamicElement"))
)

# Now safe to get page source
print(driver.page_source)

This waits for a specific element with ID myDynamicElement to appear on the page before getting the source. You can also write more generic wait logic that waits for document ready state, number of XHR requests, etc.

Some other tips for Python:

You can pass the source to BeautifulSoup to parse and scrape data
Use driver.execute_script("return document.documentElement.outerHTML") to get full rendered HTML including any DOM changes
On timeouts or AJAX sites, may need to get source as page changes repeatedly

Getting the timing right and waiting for all elements to properly load before scraping the page source is key for successful web scraping in Selenium Python.

Getting Page Source in Java Selenium

For Java Selenium tests and scraping, the process is similar. Initialize a WebDriver instance, navigate to a URL, and call getPageSource():

// Import Selenium WebDriver classes
import org.openqa.selenium.WebDriver; 
import org.openqa.selenium.chrome.ChromeDriver;

public class GetPageSource {

  public static void main(String[] args) {

    // Initialize ChromeDriver
    WebDriver driver = new ChromeDriver();

    // Navigate to url
    driver.get("http://www.example.com");
    
    // Get page source code
    String pageSource = driver.getPageSource();

    // Print source to console
    System.out.println(pageSource);

  }

}

Again, you'll want to add waits and checks for page load before getting the source code. Here's an example:

// Import wait classes
import org.openqa.selenium.support.ui.WebDriverWait;
import org.openqa.selenium.support.ui.ExpectedConditions;

public class GetPageSource {

  public static void main(String[] args) {
  
    WebDriver driver = new ChromeDriver();

    driver.get("http://www.example.com");
    
    // Wait for specific element to load 
    WebDriverWait wait = new WebDriverWait(driver, 10);
    wait.until(ExpectedConditions.visibilityOfElementLocated(By.id("myElement")));
    
    // Get page source after wait
    String pageSource = driver.getPageSource();

  }

}

The approach of waiting for elements or DOM ready state works well in Java, just like Python. Some other tips for Java:

Can parse source with JSoup HTML parser library
Use driver.executeScript("return document.body.innerHTML"); to get updated DOM HTML
Handle timeouts and retry getting source as needed

Getting Page Source in C# Selenium

For C# tests and web scraping with Selenium, the WebDriver API is similar:

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;

namespace SeleniumTests {

  class GetPageSource {

    static void Main(string[] args) {

      // Initialize Chrome Driver
      IWebDriver driver = new ChromeDriver();

      // Go to url
      driver.Navigate().GoToUrl("http://www.example.com");

      // Get page source
      string pageSource = driver.PageSource;

      // Print source
      Console.WriteLine(pageSource);

    }

  }

}

The driver.PageSource method returns the HTML source code. And again, you'll want to add waits before getting the page source:

// Wait for DOM ready
WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10));
wait.Until(driver1 => ((IJavaScriptExecutor)driver).ExecuteScript("return document.readyState").Equals("complete")); 

string pageSource = driver.PageSource;

Tips for C#:

Use HtmlAgilityPack to parse and query HTML content
Execute JS to get updated DOM with IJavaScriptExecutor
Add waits and retries for dynamic pages

Waiting for Page Load in Selenium

As these examples demonstrate, one of the key aspects of successfully getting full and accurate page source in Selenium is waiting for the page to fully load and render all its content before getting the source code.

Here are some best practices for page load waits in Selenium:

Wait for document readyState

WebDriverWait(driver, 10).until(
    lambda d: d.execute_script('return document.readyState') == 'complete'
)

Wait for JQuery AJAX requests to complete

WebDriverWait(driver, 10).until(
    lambda d: d.execute_script('return jQuery.active') == 0
)

Wait for number of network requests to stay unchanged

requests = driver.execute_script("return window.performance.getEntries().length")
WebDriverWait(driver, 10).until(
    lambda d: driver.execute_script("return window.performance.getEntries().length") == requests
)

Wait for specific elements on page

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'myElement')))

Combining multiple waits

WebDriverWait(driver, 10).until(
    lambda d: 
        d.execute_script('return document.readyState') == 'complete' and
        d.execute_script('return jQuery.active') == 0 and 
        d.find_element(By.ID, 'myElement')
)

I recommend combining multiple waits like readyState, jQuery active, and key elements on the page to ensure everything fully loads.

You can wrap these waits in functions and re-attempt getting the source if timeouts occur on more dynamic sites.

Getting Updated DOM HTML in Selenium

One limitation of driver.page_source it is only returns the initial static HTML. It does not include any DOM changes made by JavaScript after page load. To get the fully rendered DOM HTML, including JavaScript changes, we can use:

Python

dom_html = driver.execute_script("return document.documentElement.outerHTML")

Java

dom_html = (String)((JavascriptExecutor)driver).executeScript("return document.documentElement.outerHTML");

dom_html = (string)((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.outerHTML");

This will return the fully updated DOM HTML after JavaScript executes.

Handling Dynamic Content and AJAX Sites

For websites that dynamically load content or update the page frequently without full page reloads, you may need to get the page source multiple times and handle it changing throughout the session. Some examples:

Repeatedly get source on a schedule

import time

# Get initial source
source = driver.page_source 

while True:
    # Get source every 5 seconds
    time.sleep(5)  
    new_source = driver.page_source
    
    # Check if source changed
    if new_source != source:
        source = new_source
        # Do something with updated source
    
    ...

Get source on element changes

# XPath for dynamic content container
xpath = '//div[@id="content"]'

prev_html = None 

while True:
    curr_html = driver.find_element(By.XPATH, xpath).get_attribute('outerHTML')
    
    if curr_html != prev_html:
        print("Content updated!")
        # Process new HTML
    
    prev_html = curr_html
    time.sleep(1)

Retry on timeouts

import time

for i in range(3):
    try:
        source = driver.page_source
        break
    except TimeoutException:
        if i == 2:
            raise
        time.sleep(3)

For these types of sites, you'll have to experiment with the right timing and triggers to grab the page source as it updates.

Browser Differences in Page Source

It's important to note that the page source returned may differ slightly across browser vendors and WebDriver implementations. For example,

Firefox and Chrome DOM HTML can sometimes differ because they parse pages differently.
Safari and Internet Explorer also return a preprocessed version of the raw HTML source.

So you may have to account for browser quirks in some cases. Testing across browsers and inspecting the source strings they return is helpful.

Page Source vs HTMLUnitDriver

One alternative to WebDriver is HTMLUnitDriver, a “headless” browser implemented in Java. HTMLUnitDriver directly returns a parsed HTML DOM document:

HtmlPage page = webClient.getPage("http://example.com");

HtmlElement body = page.getBody(); 
String html = body.asXml();

The advantage is it gives you a parsed DOM you can directly query instead of raw source strings. However, HTMLUnitDriver cannot execute JavaScript, so the DOM will not reflect dynamic JS-generated content. It's best for static content.

Other Tips and Tricks

Here are some other useful tips for getting and handling page source in Selenium:

Use BeautifulSoup in Python or JSoup in Java to parse and query HTML
Save screenshots and page source to help debug scraping issues
Enable browser logs and network request capturing to monitor requests
Handle missing elements and stale element exceptions when scraping dynamic UIs
Use headless browsers and cloud-based Selenium services to scale scraping

And some final best practices:

Always wait for page load, and test waits before getting source
Scroll to load lazy-loaded content before getting source
Retry and handle timeouts, especially on high-traffic sites
Inspect source strings across browsers for differences
Consider combining with HTML parsing libraries for best results

Conclusion

Getting the full, accurate HTML page source is important for many test automation and web scraping use cases with Selenium WebDriver. As we covered, there are several techniques and best practices to ensure you wait for proper page load and account for JavaScript differences across languages and browsers.

Following the examples and guidelines in this guide will help you successfully get reliable page source for your Selenium scripts and integrate it into your scraping and testing workflows.