Writing a web crawler in Python 3.5+ using asyncio

The asyncio library was introduced to Python from versions, 3.4 onwards. However, the async await syntax was not introduced into the language in Python 3.5. The introduction of this functionality allows us to write asynchronous web crawlers, without having to use threads. Getting used to asynchronous programming can take a while, and in this tutorial we are going to build a fully functional web crawler using asyncio and aiohttp.

Fan In & Fan Out Concurrency Pattern


We are going to write a web crawler which will continue to crawl a particular site, until we reach a defined maximum depth. We are going to make use of a fan-in/fan-out concurrency pattern. Essentially, this involves gathering together a set of tasks, and then distributing them across a bunch of threads, or in across co-routines in our case. We then gather all the results together again, before processing them, and fanning out a new group of tasks. I would highly recommend Brett Slatkin’s 2014 talk, which inspired this particular post.

Initializing Our Crawler

We begin by importing the libraries required for our asyncio crawler. We are using a couple libraries which are not included in Python’s standard library. These required libraries can be installed using the following pip command:

We can then start defining our class. Our crawler takes two positional arguments and one optional keyword argument. We pass in the start URL, which is the URL we begin our crawl with and we also set the maximum depth of the crawl. We also pass in a maximum concurrency level which prevents our crawler from making more than 200 concurrent requests at a single time.

The start URL is then parsed to give us the base URL for the site in question. We also create a set of URLs which have already seen, to ensure that we don’t end up crawling the same URL more than once. We also create session using aiohttp.ClientSession so that we can skip, having to create a session every time we scrape a URL. Doing this will throw a warning, but the creation of a client session is synchronous, so it can be safely done outside of a co-routine. We also set up a asyncio BoundedSemaphore using our max concurrency variable, we will use this to prevent our crawler from making too many concurrent requests at one time.

Making An Async HTTP Request

We can then write a function make to a asynchronous HTTP request. Making a single asynchronous request is pretty similar to making a standard HTTP request. As you can see we write “async” prior to the function definition. We begin by using an async context manager, using the bounded semaphore created when we initialized our class. This will limit asynchronous requests to whatever we passed in when creating an instance of AsyncCrawler class.

We then use another async context manager within a try/except block to make a request to the URL, and await for the response. Before we finally return the HTML.

Extracting URLs

We can then write a standard function to extract all the URLs from a html response. We create DOM (Document Object Model) object from our HTML, using Lxml’s HTML sub-module. Once we have extracted our document model, we able to query it using either XPath or CSS selectors. Here we use a simple XPath selector to pull out the ‘href’ element of every link found on the page in question.

We can then use urllib.parse’s urljoin function with our base URL and found href. This gives an absolute URL, automatically resolving any relative URLs that we may have found on the page. If we haven’t already crawled this URL and it belongs to the site we are crawling, we add it to our list of found URLs.

The extract async function is a simple wrapper around our HTTP request and find URL functions. Should we encounter any error, we simply ignore it. Otherwise we use the HTML to create a list of URLs found on that page.

Fanning In/Out

Our extract_multi_async function is where we fan out. The function takes a list of URLs to be crawled. We begin by creating two empty lists. The first will hold the futures which refer to jobs to be done. While the second holds the results of these completed futures. We begin by adding a call to our self.extract_async function for each URL we have passed into the function. These are futures, in the sense that they are tasks which will be completed in the future.

To gather the results from these futures, we use asyncio’s as_completed function, which will iterate over the completed futures and gather the results into our results list. This function will essentially block until all of the futures are completed, meaning that we end up returning a list of completed results.

Running Our Crawler

We have a parser function defined here which will by default raise a NotImplementedError. So in order to use our crawler, we will have to sub class our crawler and write our own parsing function.  Which will do in a minute.

Our main function kicks everything off. We start off by scraping our start URL, and returning a batch of results. We then iterate over our results pulling out the URL, data, and new URLs from each result. We then send the HTML off to be parsed, before appending the relevant data to our list of results. While adding the new URLs to our to_fetch variable. We keep continuing this process until we have reached our max crawl depth, and return all the results collected during the crawl.

Sub Classing & Running the Crawler

Sub-classing the crawler is very simple, as we are able to write any function we wish to handle the HTML data returned by our crawler. The above function simply tries to extract the title from each page found by our crawler.

We can the call the crawler in a similar way to how we would call an individual asyncio function. We first initialize our class, before creating future with the asyncio.Task function passing in our crawl_async function. We then need an event loop to run this function in, which we create and run until the function has completed. We then close the loop and grab the results from our future by calling .result() on our completed future.

Full Code

 

Leave a Reply

Your email address will not be published. Required fields are marked *