Selenium Based Crawler in Python

Today, we are going to walk through creating a basic crawler making use of Selenium.

Why Build A Selenium Web Crawler?

First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.

Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.

While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.

Imports & Class Initialisation

To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.

We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.

Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘http://edmundmartin.com’ to the base_url argument.

We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.

We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.

Getting Pages With Selenium

We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.

Creating a Soup

This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.

Getting Links

Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.

Getting Data

We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.

Writing to CSV

We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.

Run Crawler Method

The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.

Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.

We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.

What Can Be Improved?

There are a number of things that can be improved on in this example Selenium crawler.

While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.

Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.

The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.

Beautiful Soup vs. lxml – Speed

When comparing Python parsing frameworks, you often hear people complaining that Beautiful Soup is considerably slower than using lxml. Thus, some people conclude that lxml should be used in any performance critical project. Having used Beautiful Soup in a large number of web scraping projects and never having had any real trouble with its performance, I wanted to properly measure the performance of the popular parsing library.

The Test

To test the two libraries, I wrote a simple single threaded crawler which crawls a total of 100 URLs and then simply extracts links and the page title from the page in question. By implementing two different parser methods, one using lxml and one using Beautiful Soup. I also tested the speed of the Beautiful Soup with various non-default parsers.

Each of the various setups were tested a total of five times to account for varying internet and server response times, with the below results outlining the different performance based on library and underlying parser.

The Results

Run #1 Run #2 Run #3 Run #4 Run #5 Avg. Speed Overhead Per Page (Seconds)
lxml 38.49 36.82 39.63 39.02 39.84 38.76 N/A
Bs4 (html.parser) 49.17 52.1 45.35 47.38 47.07 48.21 0.09
Bs4 (html5lib) 54.61 53.17 53.37 56.35 54.12 54.32 0.16
Bs4 (lxml) 42.97 43.71 46.65 44.51 47.9 45.15 0.06

As you can see lxml is significantly faster than Beautiful Soup. A pure lxml solution is several seconds faster than using Beautiful Soup with lxml as the underlying parser. The built Python parsing library is around 10 seconds slower, whereas the extremely liberal html5lib is even slower. The overhead per page parsed is still relatively small with both bs4(html.parser) and bs4(lxml) adding less than 0.1 seconds per page parsed.

Overhead 100,000 URLs (Extra Hours) 500,000 URLs (Extra Hours)
Bs4 (html.parser) 0.09454 2.6 13.1
Bs4 (html5lib) 0.15564 4.3 21.6
Bs4 (lxml) 0.06388 1.8 8.9

While the overhead seems very low, when you try and scale a crawler using Beautiful Soup will add a significant overhead. Even using Beautiful Soup with lxml adds significant overhead when you are trying to scale to hundreds of thousands of URLs. It should be noted that the above table assumes a crawler running a single thread. Anyone looking to crawl more than 100,000 URLs would be highly recommended to build a concurrent crawler making use of a library such as Twisted, Asycnio, or Concurrent futures.

So, the question of whether Beautiful Soup is suitable for your project really depends on the scale and nature of the project. Replacing Beautiful Soup with lxml is likely to see you achieve a small (but considerable at scale) performance improvements. This does however come at the cost of losing the Beautiful Soup API, which makes selecting on-page elements a breeze.

Web Scraping: Avoiding Detection

 

This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.

Switching User Agent’s

Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.

You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.

This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.

Other Request Headers

Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.

Going Slow

Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.

Proxies

In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.

Rendering JavaScript

JavaScript is pretty much used everywhere and the number of human’s not enabling JavaScript in their browsers is less than <1%. This means that some sites have looked to block IPs making large numbers of requests without rendering JavaScript. The simple solution is just to render JavaScript, using a headless browser and browser automation suite such as Selenium.

Increasingly, companies such as Cloudflare are checking whether users making requests to the site are rendering JavaScript. By using this technique, they hope to block bots making requests to the site in question. However, several libraries now exist which help you get around the kind of protection implemented by Cloudflare. Python’s cloudflare-scrape library is a wrapper around the requests library which simply run’s Cloudflare’s JavaScript test within a node environment should it detect that such a protection has been put in place.

Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.

Backing Off

What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.

Avoiding Honeypots/Bot Traps

Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.

Obeying Robots.txt

Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.

Cookies

In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.

Captchas

Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.

By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.

Concurrent Crawling in Python

Python, would seem the perfect language for writing web scrapers and crawlers. Libraries such as BeautifulSoup, Requests and lxml give programmers solid API’s to make requests and parse the data given back by web pages.

The only issue is that by default Python web scrapers and crawlers are relatively slow. This due to the issues that Python has with concurrency due to the languages GIL (Global Interpreter Lock). Compared with languages such as Golang and implementations of languages such as NodeJS building truly concurrent crawlers in Python is more challenging.

This lack of concurrency slows down crawlers due to your scripts simply idling while they await the response from the web server in question. This is particularly frustrating if some of the pages discovered are particularly slow.

In this post we are going to look at three different versions of the same script. The first version is going to lack any concurrency and simply request each of the websites one after the other. The second version makes use of concurrent futures’ thread pool executor allowing us to send concurrent requests by making use of threads. Finally, we are going to take a look at a version of the script using asyncio and aiohttp, allowing us to make concurrent requests by means of an event loop.

Non-Concurrent Scraper

A standard crawler/scraper using requests and BeautifulSoup is single threaded. This makes it very slow, as with every request have to wait for the server to respond before we can carry on with processing the results and moving onto the next URL.

A non-concurrent scraper is the simplest to code and involves the least effort. For many tasks such a crawler/scraper is more than enough for the task at hand.

The below code is an example of a very basic non-concurrent scraper which simply requests the page and grabs the title. It is this code that we will be expanding on during the post.

 

Concurrent Futures

Concurrent.futures is available as part of Python’s standard library and gives Python users a way to make concurrent requests by means of a ThreadPoolExecutor.

In the above example we initialise a class which takes a list of URLs and maximum number of threads as the initial argument. The class then has two hidden methods which handle making requests to the provided URLs and then simply parsing the titles from the HTML and returning the results to a dictionary.

These two methods are then placed in a wrapper which is then called in our run_script method. This where we get ThreadPoolExecutor involved creating a list of jobs from the URLs passed to the crawler on initialisation. We ensure that we are not starting up more threads than URLs in our list by using Python’s inbuilt min function. Python list comprehension is then used to submit the function and it’s arguments (a URL) to the executor. We then print the results of our simple crawl which have been collected in a dictionary format.

Asyncio & Aiohttp

Asyncio was introduced to the Python standard library in version 3.4. The introduction of Asyncio into Python’s standard library seriously improves Python’s concurrent credentials and there are already a number of community maintained packages expanding on the functionality of Asyncio. Using Asyncio & Aiohttp, is a little more complicated but offers increased power and even better performance.

What you will probably immediately notice about the above code is that we have written a number of  function definitions with ‘async’ prefaced to them. In Python 3.5, the asyncio library introduced these async def’s and they are just syntactic sugar for the older co-routine decorator that the library previously used.

Every time we want to write a function we intend to run asynchronously we need to either bring in the asyncio.coroutine decorator or append the async to our function definition.

The other noticeable difference is the ‘await’ keyword.  When calling an asynchronous function we must ‘await’ the result. This allows other functions to run at the same without blocking one another. Once we have made the HTTP request we await the response being read by our client which allows the event loop to make other outgoing requests.

Our handle task function simply gets a URL from the asnycio queue and then calls our other functions which make the request and deal with parsing the page. You will notice that when getting an item from the queue we have to await, just as with the calling of all other asnycio functions.

While looking more complicated the eventloop function begins by creating a queue and en-queues our URL list. We then establish a event loop and do a list comprehension passing items from the queue to our main function. We then simply pass this to the eventloop which then handles the execution of our code until there are no other URLs to handle.

Speed Comparisons

 

No-Concurrency Concurrent Futures Asyncio & Aiohttp
5 URLs 4.021 seconds 1.098 seconds 1.3197 seconds
50 URLs 79.2116 seconds 28.82 seconds 31.5012 seconds
100 URLs 157.5677 seconds 60.1970 seconds 45.4405 seconds

Running the above scripts using five threads where applicable. We can see that both of the concurrent scripts are far faster than our GIL blocking example and that at any large scale you would be recommended to go with a concurrent script.