Selenium Based Crawler in Python

Today, we are going to walk through creating a basic crawler making use of Selenium.

Why Build A Selenium Web Crawler?

First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.

Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.

While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.

Imports & Class Initialisation

To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.

We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.

Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘http://edmundmartin.com’ to the base_url argument.

We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.

We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.

Getting Pages With Selenium

We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.

Creating a Soup

This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.

Getting Links

Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.

Getting Data

We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.

Writing to CSV

We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.

Run Crawler Method

The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.

Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.

We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.

What Can Be Improved?

There are a number of things that can be improved on in this example Selenium crawler.

While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.

Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.

The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.