Concurrent Crawling in Python

Python, would seem the perfect language for writing web scrapers and crawlers. Libraries such as BeautifulSoup, Requests and lxml give programmers solid API’s to make requests and parse the data given back by web pages.

The only issue is that by default Python web scrapers and crawlers are relatively slow. This due to the issues that Python has with concurrency due to the languages GIL (Global Interpreter Lock). Compared with languages such as Golang and implementations of languages such as NodeJS building truly concurrent crawlers in Python is more challenging.

This lack of concurrency slows down crawlers due to your scripts simply idling while they await the response from the web server in question. This is particularly frustrating if some of the pages discovered are particularly slow.

In this post we are going to look at three different versions of the same script. The first version is going to lack any concurrency and simply request each of the websites one after the other. The second version makes use of concurrent futures’ thread pool executor allowing us to send concurrent requests by making use of threads. Finally, we are going to take a look at a version of the script using asyncio and aiohttp, allowing us to make concurrent requests by means of an event loop.

Non-Concurrent Scraper

A standard crawler/scraper using requests and BeautifulSoup is single threaded. This makes it very slow, as with every request have to wait for the server to respond before we can carry on with processing the results and moving onto the next URL.

A non-concurrent scraper is the simplest to code and involves the least effort. For many tasks such a crawler/scraper is more than enough for the task at hand.

The below code is an example of a very basic non-concurrent scraper which simply requests the page and grabs the title. It is this code that we will be expanding on during the post.


Concurrent Futures

Concurrent.futures is available as part of Python’s standard library and gives Python users a way to make concurrent requests by means of a ThreadPoolExecutor.

In the above example we initialise a class which takes a list of URLs and maximum number of threads as the initial argument. The class then has two hidden methods which handle making requests to the provided URLs and then simply parsing the titles from the HTML and returning the results to a dictionary.

These two methods are then placed in a wrapper which is then called in our run_script method. This where we get ThreadPoolExecutor involved creating a list of jobs from the URLs passed to the crawler on initialisation. We ensure that we are not starting up more threads than URLs in our list by using Python’s inbuilt min function. Python list comprehension is then used to submit the function and it’s arguments (a URL) to the executor. We then print the results of our simple crawl which have been collected in a dictionary format.

Asyncio & Aiohttp

Asyncio was introduced to the Python standard library in version 3.4. The introduction of Asyncio into Python’s standard library seriously improves Python’s concurrent credentials and there are already a number of community maintained packages expanding on the functionality of Asyncio. Using Asyncio & Aiohttp, is a little more complicated but offers increased power and even better performance.

What you will probably immediately notice about the above code is that we have written a number of  function definitions with ‘async’ prefaced to them. In Python 3.5, the asyncio library introduced these async def’s and they are just syntactic sugar for the older co-routine decorator that the library previously used.

Every time we want to write a function we intend to run asynchronously we need to either bring in the asyncio.coroutine decorator or append the async to our function definition.

The other noticeable difference is the ‘await’ keyword.  When calling an asynchronous function we must ‘await’ the result. This allows other functions to run at the same without blocking one another. Once we have made the HTTP request we await the response being read by our client which allows the event loop to make other outgoing requests.

Our handle task function simply gets a URL from the asnycio queue and then calls our other functions which make the request and deal with parsing the page. You will notice that when getting an item from the queue we have to await, just as with the calling of all other asnycio functions.

While looking more complicated the eventloop function begins by creating a queue and en-queues our URL list. We then establish a event loop and do a list comprehension passing items from the queue to our main function. We then simply pass this to the eventloop which then handles the execution of our code until there are no other URLs to handle.

Speed Comparisons


No-Concurrency Concurrent Futures Asyncio & Aiohttp
5 URLs 4.021 seconds 1.098 seconds 1.3197 seconds
50 URLs 79.2116 seconds 28.82 seconds 31.5012 seconds
100 URLs 157.5677 seconds 60.1970 seconds 45.4405 seconds

Running the above scripts using five threads where applicable. We can see that both of the concurrent scripts are far faster than our GIL blocking example and that at any large scale you would be recommended to go with a concurrent script.

5 thoughts to “Concurrent Crawling in Python”

  1. If you spawned 50 threads to test 50 urls, and 100 threads to test 100 URLs with the async libraries, do you see a major difference in the speed since IO is non-blocking?

    1. The IO is non-blocking, however in standard Python (CPython) there is something called the GIL. This essentially means that only one piece of Python code can run at a time even when using threads. That means the performance of threads will slowly decrease as you add more and more threads. This is essentially due to the fact that the threads are interrupting one another. I wrote a piece looking at what the optimal number of threads for this kind of task was and it seems that you reach optimal performance around 140-150 thread mark.

  2. In my experience, the Aiohttp example will be even more complicated in production, since you need to do proper error handling, which will bring another complexity (the current version crashes if there is wrong url). If there is no special reason for aiohttp, the ThreadPoolExecutor solution (or also gevent solution) is much less complicated.

  3. Do you use __parse_results in threads? I do not understand where to use my function which processes requested pages.

    I have two functions: 1st makes requests to each URL, the second function makes a large search in scraped html using BeautifulSoup and saves results to SQLite.
    Do I need to aggregate all URLs request`s results to one dictionary using threads and after this process this dictionary (page analyze + saving results to a database) without using threads?

    1. Hello Aleksey,

      It’s quite hard to tell without seeing your code. Though I would probably retrieve all of your URLs using threads and then work through the results one by one. Unfortunately, SQLite is single threaded when writing to the database. One way to avoid aggregating the results before working through them would be to add a done callback to each task you to the submit to the pool, this would see the callback function being run with the result of the task.

      Would be happy to take a look at your code if you want to reach out with it.

Leave a Reply

Your email address will not be published. Required fields are marked *