When doing web scraping or testing web pages using Selenium, one common task is to retrieve the full page source code (the full HTML) of the website you are interacting with. The page source allows you to see and parse the full structure and content of the page using tools like BeautifulSoup in Python.
In this comprehensive guide, I'll explain multiple methods and best practices for how to get the page source in Selenium using Python, Java, C#, and other languages.
Overview of Getting Page Source in Selenium
The high level steps to get page source in Selenium are:
- Initialize a Selenium WebDriver instance and navigate to a URL
- Call
driver.page_source
to get the raw HTML source code - Parse or process the HTML as needed for your application
However, there are some important caveats:
- The page source may not be fully loaded when you retrieve it – you often need to wait for elements or JavaScript to load
- The source won't include dynamically generated content after page load
- There are differences in how page source works across WebDriver languages and browsers
I'll cover all these topics in detail below, including code examples and best practices.
Getting Page Source in Python Selenium
Python is one of the most popular languages for using Selenium WebDriver for web scraping and automation. Here is an example Python script that initializes a ChromeDriver instance, navigates to a URL, and prints out the page source:
from selenium import webdriver driver = webdriver.Chrome() driver.get("http://www.example.com") print(driver.page_source)
The key line is driver.page_source
which returns the raw HTML source code as a string. However, this example has some major limitations:
- It will print the source immediately after navigating before the page is fully loaded
- It won't include any JavaScript-generated content
- It won't wait for any XHR/fetch requests that may contain additional data
Here is an improved version that waits for the page to fully load before getting the source:
from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() driver.get("http://www.example.com") # Wait for page to fully load WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "myDynamicElement")) ) # Now safe to get page source print(driver.page_source)
This waits for a specific element with ID myDynamicElement
to appear on the page before getting the source. You can also write more generic wait logic that waits for document ready state, number of XHR requests, etc.
Some other tips for Python:
- You can pass the source to BeautifulSoup to parse and scrape data
- Use
driver.execute_script("return document.documentElement.outerHTML")
to get full rendered HTML including any DOM changes - On timeouts or AJAX sites, may need to get source as page changes repeatedly
Getting the timing right and waiting for all elements to properly load before scraping the page source is key for successful web scraping in Selenium Python.
Getting Page Source in Java Selenium
For Java Selenium tests and scraping, the process is similar. Initialize a WebDriver instance, navigate to a URL, and call getPageSource()
:
// Import Selenium WebDriver classes import org.openqa.selenium.WebDriver; import org.openqa.selenium.chrome.ChromeDriver; public class GetPageSource { public static void main(String[] args) { // Initialize ChromeDriver WebDriver driver = new ChromeDriver(); // Navigate to url driver.get("http://www.example.com"); // Get page source code String pageSource = driver.getPageSource(); // Print source to console System.out.println(pageSource); } }
Again, you'll want to add waits and checks for page load before getting the source code. Here's an example:
// Import wait classes import org.openqa.selenium.support.ui.WebDriverWait; import org.openqa.selenium.support.ui.ExpectedConditions; public class GetPageSource { public static void main(String[] args) { WebDriver driver = new ChromeDriver(); driver.get("http://www.example.com"); // Wait for specific element to load WebDriverWait wait = new WebDriverWait(driver, 10); wait.until(ExpectedConditions.visibilityOfElementLocated(By.id("myElement"))); // Get page source after wait String pageSource = driver.getPageSource(); } }
The approach of waiting for elements or DOM ready state works well in Java, just like Python. Some other tips for Java:
- Can parse source with JSoup HTML parser library
- Use
driver.executeScript("return document.body.innerHTML");
to get updated DOM HTML - Handle timeouts and retry getting source as needed
Getting Page Source in C# Selenium
For C# tests and web scraping with Selenium, the WebDriver API is similar:
using OpenQA.Selenium; using OpenQA.Selenium.Chrome; namespace SeleniumTests { class GetPageSource { static void Main(string[] args) { // Initialize Chrome Driver IWebDriver driver = new ChromeDriver(); // Go to url driver.Navigate().GoToUrl("http://www.example.com"); // Get page source string pageSource = driver.PageSource; // Print source Console.WriteLine(pageSource); } } }
The driver.PageSource
method returns the HTML source code. And again, you'll want to add waits before getting the page source:
// Wait for DOM ready WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10)); wait.Until(driver1 => ((IJavaScriptExecutor)driver).ExecuteScript("return document.readyState").Equals("complete")); string pageSource = driver.PageSource;
Tips for C#:
- Use HtmlAgilityPack to parse and query HTML content
- Execute JS to get updated DOM with
IJavaScriptExecutor
- Add waits and retries for dynamic pages
Waiting for Page Load in Selenium
As these examples demonstrate, one of the key aspects of successfully getting full and accurate page source in Selenium is waiting for the page to fully load and render all its content before getting the source code.
Here are some best practices for page load waits in Selenium:
Wait for document readyState
WebDriverWait(driver, 10).until( lambda d: d.execute_script('return document.readyState') == 'complete' )
Wait for JQuery AJAX requests to complete
WebDriverWait(driver, 10).until( lambda d: d.execute_script('return jQuery.active') == 0 )
Wait for number of network requests to stay unchanged
requests = driver.execute_script("return window.performance.getEntries().length") WebDriverWait(driver, 10).until( lambda d: driver.execute_script("return window.performance.getEntries().length") == requests )
Wait for specific elements on page
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'myElement')))
Combining multiple waits
WebDriverWait(driver, 10).until( lambda d: d.execute_script('return document.readyState') == 'complete' and d.execute_script('return jQuery.active') == 0 and d.find_element(By.ID, 'myElement') )
I recommend combining multiple waits like readyState, jQuery active, and key elements on the page to ensure everything fully loads.
You can wrap these waits in functions and re-attempt getting the source if timeouts occur on more dynamic sites.
Getting Updated DOM HTML in Selenium
One limitation of driver.page_source
it is only returns the initial static HTML. It does not include any DOM changes made by JavaScript after page load. To get the fully rendered DOM HTML, including JavaScript changes, we can use:
Python
dom_html = driver.execute_script("return document.documentElement.outerHTML")
Java
dom_html = (String)((JavascriptExecutor)driver).executeScript("return document.documentElement.outerHTML");
C#
dom_html = (string)((IJavaScriptExecutor)driver).ExecuteScript("return document.documentElement.outerHTML");
This will return the fully updated DOM HTML after JavaScript executes.
Handling Dynamic Content and AJAX Sites
For websites that dynamically load content or update the page frequently without full page reloads, you may need to get the page source multiple times and handle it changing throughout the session. Some examples:
Repeatedly get source on a schedule
import time # Get initial source source = driver.page_source while True: # Get source every 5 seconds time.sleep(5) new_source = driver.page_source # Check if source changed if new_source != source: source = new_source # Do something with updated source ...
Get source on element changes
# XPath for dynamic content container xpath = '//div[@id="content"]' prev_html = None while True: curr_html = driver.find_element(By.XPATH, xpath).get_attribute('outerHTML') if curr_html != prev_html: print("Content updated!") # Process new HTML prev_html = curr_html time.sleep(1)
Retry on timeouts
import time for i in range(3): try: source = driver.page_source break except TimeoutException: if i == 2: raise time.sleep(3)
For these types of sites, you'll have to experiment with the right timing and triggers to grab the page source as it updates.
Browser Differences in Page Source
It's important to note that the page source returned may differ slightly across browser vendors and WebDriver implementations. For example,
- Firefox and Chrome DOM HTML can sometimes differ because they parse pages differently.
- Safari and Internet Explorer also return a preprocessed version of the raw HTML source.
So you may have to account for browser quirks in some cases. Testing across browsers and inspecting the source strings they return is helpful.
Page Source vs HTMLUnitDriver
One alternative to WebDriver is HTMLUnitDriver, a “headless” browser implemented in Java. HTMLUnitDriver directly returns a parsed HTML DOM document:
HtmlPage page = webClient.getPage("http://example.com"); HtmlElement body = page.getBody(); String html = body.asXml();
The advantage is it gives you a parsed DOM you can directly query instead of raw source strings. However, HTMLUnitDriver cannot execute JavaScript, so the DOM will not reflect dynamic JS-generated content. It's best for static content.
Other Tips and Tricks
Here are some other useful tips for getting and handling page source in Selenium:
- Use BeautifulSoup in Python or JSoup in Java to parse and query HTML
- Save screenshots and page source to help debug scraping issues
- Enable browser logs and network request capturing to monitor requests
- Handle missing elements and stale element exceptions when scraping dynamic UIs
- Use headless browsers and cloud-based Selenium services to scale scraping
And some final best practices:
- Always wait for page load, and test waits before getting source
- Scroll to load lazy-loaded content before getting source
- Retry and handle timeouts, especially on high-traffic sites
- Inspect source strings across browsers for differences
- Consider combining with HTML parsing libraries for best results
Conclusion
Getting the full, accurate HTML page source is important for many test automation and web scraping use cases with Selenium WebDriver. As we covered, there are several techniques and best practices to ensure you wait for proper page load and account for JavaScript differences across languages and browsers.
Following the examples and guidelines in this guide will help you successfully get reliable page source for your Selenium scripts and integrate it into your scraping and testing workflows.