Concurrent Crawling in Python

Python, would seem the perfect language for writing web scrapers and crawlers. Libraries such as BeautifulSoup, Requests and lxml give programmers solid API’s to make requests and parse the data given back by web pages.

The only issue is that by default Python web scrapers and crawlers are relatively slow. This due to the issues that Python has with concurrency due to the languages GIL (Global Interpreter Lock). Compared with languages such as Golang and implementations of languages such as NodeJS building truly concurrent crawlers in Python is more challenging.

This lack of concurrency slows down crawlers due to your scripts simply idling while they await the response from the web server in question. This is particularly frustrating if some of the pages discovered are particularly slow.

In this post we are going to look at three different versions of the same script. The first version is going to lack any concurrency and simply request each of the websites one after the other. The second version makes use of concurrent futures’ thread pool executor allowing us to send concurrent requests by making use of threads. Finally, we are going to take a look at a version of the script using asyncio and aiohttp, allowing us to make concurrent requests by means of an event loop.

Non-Concurrent Scraper

A standard crawler/scraper using requests and BeautifulSoup is single threaded. This makes it very slow, as with every request have to wait for the server to respond before we can carry on with processing the results and moving onto the next URL.

A non-concurrent scraper is the simplest to code and involves the least effort. For many tasks such a crawler/scraper is more than enough for the task at hand.

The below code is an example of a very basic non-concurrent scraper which simply requests the page and grabs the title. It is this code that we will be expanding on during the post.

 

Concurrent Futures

Concurrent.futures is available as part of Python’s standard library and gives Python users a way to make concurrent requests by means of a ThreadPoolExecutor.

In the above example we initialise a class which takes a list of URLs and maximum number of threads as the initial argument. The class then has two hidden methods which handle making requests to the provided URLs and then simply parsing the titles from the HTML and returning the results to a dictionary.

These two methods are then placed in a wrapper which is then called in our run_script method. This where we get ThreadPoolExecutor involved creating a list of jobs from the URLs passed to the crawler on initialisation. We ensure that we are not starting up more threads than URLs in our list by using Python’s inbuilt min function. Python list comprehension is then used to submit the function and it’s arguments (a URL) to the executor. We then print the results of our simple crawl which have been collected in a dictionary format.

Asyncio & Aiohttp

Asyncio was introduced to the Python standard library in version 3.4. The introduction of Asyncio into Python’s standard library seriously improves Python’s concurrent credentials and there are already a number of community maintained packages expanding on the functionality of Asyncio. Using Asyncio & Aiohttp, is a little more complicated but offers increased power and even better performance.

What you will probably immediately notice about the above code is that we have written a number of  function definitions with ‘async’ prefaced to them. In Python 3.5, the asyncio library introduced these async def’s and they are just syntactic sugar for the older co-routine decorator that the library previously used.

Every time we want to write a function we intend to run asynchronously we need to either bring in the asyncio.coroutine decorator or append the async to our function definition.

The other noticeable difference is the ‘await’ keyword.  When calling an asynchronous function we must ‘await’ the result. This allows other functions to run at the same without blocking one another. Once we have made the HTTP request we await the response being read by our client which allows the event loop to make other outgoing requests.

Our handle task function simply gets a URL from the asnycio queue and then calls our other functions which make the request and deal with parsing the page. You will notice that when getting an item from the queue we have to await, just as with the calling of all other asnycio functions.

While looking more complicated the eventloop function begins by creating a queue and en-queues our URL list. We then establish a event loop and do a list comprehension passing items from the queue to our main function. We then simply pass this to the eventloop which then handles the execution of our code until there are no other URLs to handle.

Speed Comparisons

 

No-Concurrency Concurrent Futures Asyncio & Aiohttp
5 URLs 4.021 seconds 1.098 seconds 1.3197 seconds
50 URLs 79.2116 seconds 28.82 seconds 31.5012 seconds
100 URLs 157.5677 seconds 60.1970 seconds 45.4405 seconds

Running the above scripts using five threads where applicable. We can see that both of the concurrent scripts are far faster than our GIL blocking example and that at any large scale you would be recommended to go with a concurrent script.

Log File Parsing for SEO’s

What is a log file?

Every time a request is made to a server this request is logged in the site’s log files. This means that log files record every single request whether it is made by a bot or by a real visitor to your site.

These logs record the following useful information:

  • IP Address of the visitor
  • The Date & Time of the request
  • The resource requested
  • The status code of this resource
  • The number of bytes downloaded

Why does this matter to SEO’s?

SEO’s need to know how search engine spiders are interacting with their sites. Google Search console provides some limited data including the number of pages visited and the amount of information downloaded. But this limited information doesn’t give us any real insight into how Googlebot is interacting with different templates, and different directories.

Looking at this log file data often turns up some interesting insights. When looking at one client’s site, we discovered that Googlebot was spending a large portion of it’s time crawling pages which drove only 3% of overall organic traffic. These pages where essentially eating massively into the site’s overall crawl budget, leading to a decrease in overall organic traffic. This data point would simply not be available to someone who just looked at the top level crawl data contained in Google’s Search Console.

What Can We Discover in Log Files

Log files allow us to discover what resources are being requested by Google. This can help us achieve the following:

  • See how frequently important URLs are crawled
  • See the full list of errors discovered by Googlebot during crawls
  • See whether Google is fully rendering pages using front-end frameworks such as React and AngularJS
  • See whether site structure can be improved to ensure commercially valuable URLs are crawled
  • Check implementation of Robots.txt rules, ensuring blocked pages are not being crawled and that certain page types are not being unexpectedly blocked.

Options Available 

There are a number of different options available for SEO’s who want to dive into their log file data.

Analysing Data in Excel

As log files are essentially space separated text files, it’s possible to open these log files up in Excel and then analyse them. This can be fine when you are working with a very small site, but with bigger sites it isn’t a viable option. Just verifying whether traffic is really from Googlebot can be quite a pain and Excel is not really designed for this type of analysis.

Commercial Options

There are a number of commercial options available for those who want to undertake analysis of their log files. Botify offer a log file analysis service to users depending on their subscription package. Of course the Botify subscription is very pricey, and is not going to be a viable option for many SEO’s.

Screaming Frog also offer a log file parsing tool, which is available for £99 a year. As with Screaming Frog’s SEO Spider the mileage you will get with this tool really depends on the size of your available RAM and how big the site you are dealing with is.

Parsing Log Files with Python

Log files are plain text files and can easily be parse by Python. The valuable information contained on each line of a log file is delimited by a space. This means that we are able to run through the file line by line, and pull out the relevant data.

There are a couple of challenges when parsing a log file for SEO purposes:

  • We need to pull out Googlebot results and verify that they are actually from Google IP address.
  • We need to normalise date formats to make our analysis of our log files easier.

I have written a Python script which deals with both of these issues and outputs the underlying data to a SQLite database. Once we have the data into SQL format, we can either export this into a CSV file for further analysis or query the database to pull out specific information.

Our Script

The above script can be used without any Python knowledge, by simply moving the script to the directory where you have downloaded your log files. On the last line of the script, the user changes out the ‘example-extension’ to the extension that their log files have been saved with and updates ‘exampledomain.com’ with the domain in question.

As the above script is not reliant on libraries outside of Python’s standard library, anyone with Python should be able to save the script and simply run it.

Scraping difficult sites using private API’s

The Problem With The Modern Web

The widespread use of front-end JavaScript frameworks such as AngularJS and React is making the web more difficult to scrape using traditional techniques. When the content we want to access is being rendered after the initial request, simply making an old fashioned http request and then parsing the resulting content is not going to do us much good.

Browser Automation

Typically, those who are struggling to scrape data from ‘difficult to scrape’ sites resort to browser automation.  There are a myriad of different tools which allow developers to automate a browser environment. iMacros and Selenium are among the most popular tools used to extract data from these ‘difficult’ sites. A significant number of popular programming languages are supported by various ‘Selenium’ drivers. In the Python community, the standard response is that users should simply use Selenium to automate the browser of their choice and collect data that way.

Automating the browser presents it’s own challenge. Setting appropriate timeouts and ensuring that some expected error doesn’t stop your crawl in it’s tracks can be quite the challenge. In a lot of cases we can avoid the tricky task of browser automation and simply extract our data by leveraging underlying API’s.

Making Use of Private API’s


Ryan Mitchell, the author of ‘Web Scraping with Python‘ gave a very good talk on this very subject at DefCon 24. She talked extensively how the task of scraping ‘difficult’ websites can be avoided by simply looking to leverage the underlying API’s which power these modern web applications. She provided one specific example of a site, that a client had asked to scrape which had it’s own underlying API.

The site used by Ryan Mitchell as an example was the ‘Official Crossfit Affiliate’s Map’ site, which provides visitors to the Crossfit site with a way to navigate around a map containing the location of every single Crossfit affiliate. Anyone wanting to extract this information by automating the browser would likely have a torrid time trying to zoom and click on the various points on the map.

The site is in-fact powered by a very simple unprotected API. Every time an individual location is clicked on this triggers a request to the underlying API. This request returns a JSON file containing all of the required information. In order to discover this API endpoint, all one needs to do is have the developer console open while they play around with the map.

This makes our job particularly easy and we can collect the data from the map with a simple script. All we need to is iterate through each of the ID’s and then extract the required data from the JSON. We can then store this data as CSV, or in a simple SQLite database. This is just one specific example, but there are many other examples of sites where it is possible to pull useful information via a private API.

Instagram’s Private API

The Instagram web application also has a very easy to access API. Again discovering these API endpoint’s is not particularly difficult, you simply have to browse around the Instagram site with your console open to see the call’s made to the private Instagram API.

Once you are logged into Instagram, you are able to make requests to this API and receive significant amounts of data in return. This is particularly useful as the official Instagram API requires significant amounts of approval and vetting before it can be used beyond a limited number of trial accounts.

When using the browser with the developer console active, you will notice a number of XHR requests pop up in your console. Many of these can be manipulated to pull out useful and interesting data points.

https://www.instagram.com/graphql/query/?query_id=17851374694183129&id=18428658&first=4000

Using the above URL contains the user’s unique Instagram ID and allows to pull out the user’s most recent 4,000 followers. This can be used to discover and target individuals who follow a particular Instagram user. Should you be logged into the Instagram’s web platform, the above URL should return Kim Kardashian’s 4,000 most recent followers.

There are a number of other useful API’s easily accessible from the web application, but I will let the reader explore these for themselves.

Takeaways

Should you be looking to extract data from ‘difficult’ to scrape sites, it’s definitely worth looking for underlying private API’s. Often the data you need can be accessed from these API endpoints. Grabbing this well formatted and consistently data will save you a lot time and avoids some of the headaches associated with browser automation.

Random User-Agent in Requests (Python)

When using the Python requests library to extract data from websites, you may want to avoid detection and minimise the chances of your scraping activities being detected.

Setting a Custom User-Agent

To lower the chances of detention it is often recommended that users set a custom header. The requests library makes it very easy to set a custom user-agent. Often this is enough to avoid detection, with system administrators only looking for default user-agents when adding server side blocking rules.

Setting a Random User-Agent

If engaged in commercially sensitive scrapping, you may want to take additional precautions and randomise the User-Agent sent with each request.

The above snippet of code returns a random user-agent and Chrome’s default ‘Accept’ heading. When writing this code snippet I took efforts to include the ten most commonly used desktop browsers. It would probably be worth updating the browser list from time to time, to ensure that the user agents included in the list are up to date.

I have seen others loading a large list of hundreds and hundreds of user-agents. But I think this approach is misguided as it may see your crawlers make thousands of requests from very rarely used user agents.

Anyone looking closely at the ‘Accept’ headers will quickly realise that all of the different user agents are using the same ‘Accept’ header. Thankfully, the majority of system administrators completely ignore the intricacies of the sent ‘Accept’ headers and simply check if browsers are sending something plausible. Should it really be necessary, it would also be possible to send accurate ‘Accept’ headers with each request. I have never personally had to resort to this extreme measure.