Selenium Tips & Tricks in Python

Selenium is a great tool and can be used for a variety of different purposes. It can sometimes however be a bit tricky to make Selenium behave exactly how you want. This article shows you how you can make the most of the libraries advanced features, to make your life easier and help you extract data from websites.

Running Chrome Headless

Provided you have one of the latest versions of Chromdriver, it is now very easy to run selenium headless. This allows you to run the browser in the background without a visible window. We can simply add a couple of lines code to our browser on start-up and accessing webpages with selenium running quietly in the background. It should be noted that some sites can detect whether you are running Chrome headless and may block you from accessing content.

Using A Proxy With Selenium

There are occasions that you may want to use a proxy with Selenium. To use a proxy with Selenium we simply add an argument to Chrome Options when initialing our Selenium instance. Unfortunately, there is no way to change the proxy used once set. This means to rotate proxies while using Selenium, you have to either restart the Selenium browser or use a rotating proxy service which can come with it’s own set of issues.

Accessing Content Within An Iframe

Sometimes the content we want to extract from a website may be buried within an iframe. By default when you ask Selenium to return you the html content of a page, you will miss out on all the information contained within any iframes on the page. You can however access content contained within the iframe.

To switch to the iframe we want to extract data from, we first use Selenium’s find_element method. I would recommend using find_element_by_css_selector method which tends to be more reliable than trying to extract content by using an xpath selector. We then pass our target to a method which allows us to switch the browsers context to our target iframe. We can then access the HTML content and interact with content within the iframe. If we want to revert back to our original context, we simply call the revert to default content by switching to the default content.

Accessing Slow Sites

The modern web is overloaded with JavaScript, and this can cause Selenium to throw a lot of timeout errors, with Selenium timing out if a page takes more than 20 seconds to load. The simplest way to deal with this to increase Selenium’s default timeout. This is particularly useful when trying to access sites via a proxy, which slow down your connection speed.

Scrolling

Selenium by default does not allow users to scroll down pages. The browser automation framework does however allow users to execute JavaScript. This makes it very easy to scroll down pages, this is particularly useful when trying to scrape content from a page which continues to load content as the user scrolls down.

For some reason Selenium can be funny with executing window scroll commands, and it is sometimes necessary to call the command in a loop in order to scroll down the entirety of a page.

Executing JavaScript & Returning The Result

While many users of Selenium know that is is possible to run JavaScript allowing for more complicated interactions with the page, fewer know that it is also possible to return the result of executed JavaScript. This allows your browser to execute functions defined in the pages DOM and return the results to your Python script. This can be great for extracting data from tough to scrape websites.

Leave a Reply

Your email address will not be published. Required fields are marked *