Writing a web crawler in Python 3.5+ using asyncio

The asyncio library was introduced to Python from versions, 3.4 onwards. However, the async await syntax was not introduced into the language in Python 3.5. The introduction of this functionality allows us to write asynchronous web crawlers, without having to use threads. Getting used to asynchronous programming can take a while, and in this tutorial we are going to build a fully functional web crawler using asyncio and aiohttp.

Fan In & Fan Out Concurrency Pattern


We are going to write a web crawler which will continue to crawl a particular site, until we reach a defined maximum depth. We are going to make use of a fan-in/fan-out concurrency pattern. Essentially, this involves gathering together a set of tasks, and then distributing them across a bunch of threads, or in across co-routines in our case. We then gather all the results together again, before processing them, and fanning out a new group of tasks. I would highly recommend Brett Slatkin’s 2014 talk, which inspired this particular post.

Initializing Our Crawler

We begin by importing the libraries required for our asyncio crawler. We are using a couple libraries which are not included in Python’s standard library. These required libraries can be installed using the following pip command:

We can then start defining our class. Our crawler takes two positional arguments and one optional keyword argument. We pass in the start URL, which is the URL we begin our crawl with and we also set the maximum depth of the crawl. We also pass in a maximum concurrency level which prevents our crawler from making more than 200 concurrent requests at a single time.

The start URL is then parsed to give us the base URL for the site in question. We also create a set of URLs which have already seen, to ensure that we don’t end up crawling the same URL more than once. We also create session using aiohttp.ClientSession so that we can skip, having to create a session every time we scrape a URL. Doing this will throw a warning, but the creation of a client session is synchronous, so it can be safely done outside of a co-routine. We also set up a asyncio BoundedSemaphore using our max concurrency variable, we will use this to prevent our crawler from making too many concurrent requests at one time.

Making An Async HTTP Request

We can then write a function make to a asynchronous HTTP request. Making a single asynchronous request is pretty similar to making a standard HTTP request. As you can see we write “async” prior to the function definition. We begin by using an async context manager, using the bounded semaphore created when we initialized our class. This will limit asynchronous requests to whatever we passed in when creating an instance of AsyncCrawler class.

We then use another async context manager within a try/except block to make a request to the URL, and await for the response. Before we finally return the HTML.

Extracting URLs

We can then write a standard function to extract all the URLs from a html response. We create DOM (Document Object Model) object from our HTML, using Lxml’s HTML sub-module. Once we have extracted our document model, we able to query it using either XPath or CSS selectors. Here we use a simple XPath selector to pull out the ‘href’ element of every link found on the page in question.

We can then use urllib.parse’s urljoin function with our base URL and found href. This gives an absolute URL, automatically resolving any relative URLs that we may have found on the page. If we haven’t already crawled this URL and it belongs to the site we are crawling, we add it to our list of found URLs.

The extract async function is a simple wrapper around our HTTP request and find URL functions. Should we encounter any error, we simply ignore it. Otherwise we use the HTML to create a list of URLs found on that page.

Fanning In/Out

Our extract_multi_async function is where we fan out. The function takes a list of URLs to be crawled. We begin by creating two empty lists. The first will hold the futures which refer to jobs to be done. While the second holds the results of these completed futures. We begin by adding a call to our self.extract_async function for each URL we have passed into the function. These are futures, in the sense that they are tasks which will be completed in the future.

To gather the results from these futures, we use asyncio’s as_completed function, which will iterate over the completed futures and gather the results into our results list. This function will essentially block until all of the futures are completed, meaning that we end up returning a list of completed results.

Running Our Crawler

We have a parser function defined here which will by default raise a NotImplementedError. So in order to use our crawler, we will have to sub class our crawler and write our own parsing function.  Which will do in a minute.

Our main function kicks everything off. We start off by scraping our start URL, and returning a batch of results. We then iterate over our results pulling out the URL, data, and new URLs from each result. We then send the HTML off to be parsed, before appending the relevant data to our list of results. While adding the new URLs to our to_fetch variable. We keep continuing this process until we have reached our max crawl depth, and return all the results collected during the crawl.

Sub Classing & Running the Crawler

Sub-classing the crawler is very simple, as we are able to write any function we wish to handle the HTML data returned by our crawler. The above function simply tries to extract the title from each page found by our crawler.

We can the call the crawler in a similar way to how we would call an individual asyncio function. We first initialize our class, before creating future with the asyncio.Task function passing in our crawl_async function. We then need an event loop to run this function in, which we create and run until the function has completed. We then close the loop and grab the results from our future by calling .result() on our completed future.

Full Code

 

Aiohttp – Background Tasks

Python gets a lot of flak for its performance story. However, the introduction of Aysncio into the standard library goes someway to resolving some of those performance problems. There is now a wide choice of libraries which make use of the new async/await syntax, including a number of server implementations.

The Aiohttp library comes with both a client and server. However, today I want to focus on the server and one of my favourite features – background tasks. Typically, when building a Python based micro-service with Flask, you might have a background task running in something like Celery. While these background tasks are more limited than Celery tasks, they allow you to run tasks in the background while still receiving requests.

A Simple Example

I have written code, which provides you with a simple example of how you can use such a background task. We are going to write a server that has one endpoint. This endpoint allows a user to post a JSON dictionary containing a URL. This URL is then sent to a thread pool where it is immediately scraped without blocking the users request. Once we have the data we need from the URL this placed in queue which is then processed by our background task which simply posts the data to another endpoint.

Get & Post Requests

For this we are going to need to implement a post and get request method. Our get request is going to be run in a thread pool so we can use the ever-popular requests library to grab our page. However, our post-request is going to be made inside an async background task, so itself must be asynchronous. Otherwise we would end up blocking the event loop.

Both of these small functions are pretty basic and don’t do much in the way of error or logging, but are enough to demonstrate the workings of a background task.

Initialising the Server and Setting Up Our Route

We begin by initialising our server class by passing a port and host. We also define a thread pool, which will use to run our synchronous get requests. If you had a long running CPU bound task, you could instead use a process pool in much the same way. We finally create a queue using deque from the collections module, allowing us to easily append and pop data from our queue. It is this queue that our background task will process. Finally, we have the example endpoint which we will use to post our data off to.

We then move onto defining our async view. This particular view is very simple, we simply await the JSON from the incoming request. Then we attempt to grab the URL from the provided JSON. If the JSON contains a URL, we send the URL to our get_request function which is then executed within the thread pool. This allows us to return a response to the person making a request without blocking. We add a call back to this which will be executed, once the request is completed. The callback simply puts the data in our queue which will be processed by our background task.

Creating & Registering The Background Task

Our background task is very simple. It is simply an async function which contains a while True loop. Inside this loop we check if there are any items in the queue to be post to our dummy server. If there are any items, we pop these items and make an async post request. If there are no items we await asyncio.sleep. This is very important. Without putting the await statement here, we could end up in a situation where our background task never gives up the event loop to incoming server requests.
We then define two async functions which simply take in our yet to be created app and then add a task to the event loop. This allows the background task to be run in the same event loop as the server and cancel when the server itself is shut down.

The Most Complicated Bit: Creating Our App

This part of the code is the most confusing. Our create app function simply returns a web app with our route added to the server’s homepage. In the run app method we then run this application forever within our event loop. Appending the tasks which are to be run on start up and shut down of the server. We can then finally pass our app to the web.run_app function to be run on our specified host and port.

Complete Code

We now have a simple server which takes requests, deals with them and then processes them in the background. This can be very powerful and can be used to create servers which can process long running tasks by using these tasks in conjunction with thread and process pools.