Multi-Threaded Crawler in Python

Python is a great language for writing web scrapers and web crawlers. Libraries such as BeauitfulSoup, requests and lxml make grabbing and parsing a web page very simple. By default, Python programs are single threaded. This can make scraping an entire site using a Python crawler painfully slow. We must wait for each page to load before moving onto the next one. Thankfully, Python supports threads which while not appropriate for all tasks, can help us increase the performance of our web crawler.

In this post we are going to outline how you can build a simple multi-threaded crawler which will crawl an entire site using requests, BeautifulSoup and the standard library’s concurrent futures library.

Imports

We are going to begin by importing all the libraries we need for scraping. Both requests and BeautifulSoup are not included within the Python standard library. So you are going to have to install them if you haven’t done already. The other libraries you should already have available to you, if you are using Python3.

Setting Up Our Class

We start by initialising the class we are going to use to create our web crawler. Our initialisation statement only takes one argument. We pass our start URL as an argument, and from this we use urlparse from the urllib.parse library to pull out the sites homepage. This root URL is going to be used later to ensure that our scraper doesn’t end up on other sites.

We also initialise a thread pool. We are later going to submit ‘tasks’ to this thread pool, allowing us to use a callback function to collect our results. This will allow us to continue with execution of our main program, while we await the response from the website.

We also initialise a set which is going to contain a list of all the URLs which we have crawled. We will use this store URLs which have already been crawled, to prevent the crawler from visiting the same URL twice.

We then finally create a Queue which will contain URLs we wish to crawl, we will continue to grab URLs from our queue until it’s empty. Finally, we place in our base URL to the start of the queue.

Parsing Links and Scraping

Next, we write a basic link parser. Our goal here is to extract all of a sites internal links and not to pull out any external links. Additionally, we want to resolve relative URLs (those starting with ‘/’) and ensure that we don’t crawl them same URLs twice.

To do this we generate a Soup object using BeautifulSoup. We then use the find_all method to return every ‘a’ element which has a ‘href’ property. By doing this we ensure that we only return ‘a’ elements which contain a link. Our returned object is a list of dictionary objects which we then iterate through. First, we pull out the actual href content. We can then check whether this link is relative (starting with a ‘/’) or starts with our root URL. If this is the case we can then use URL join to generate a crawlable URL and then put this in our queue provided we haven’t already crawled it.

I have also included an empty scrape_info method which can be overridden so you can extract the data you want from the site you are crawling.

Defining Our Callback

The easiest and often most performant way to use a thread pool executor is to add a callback to the function we submit to the thread pool. This function will execute after the previous function has completed and will be passed the result of our previous function as an argument.

By calling .result() on the passed in argument we are able to get to the contents of our returned value. In our case this will be either ‘None’ or a requests response object. When then check if we have a result and whether the result has a 200 status code. If both of these turn out to be true we send the html to the parse_links and currently empty scrape_info function.

Scraping Pages

We then define our function which will be used to scrape the page. This function is very simple and simply takes a URL and returns a response object if it was successful. Otherwise we return ‘None’. By limiting the amount of CPU bound work we do in this function, we can increase the overall speed of our crawler. Threads are not recommended when doing CPU bound work and can actually turn out to be slower than using a single thread.

Run Scraper Function

The run scraper function brings all of our previous work together and manages our thread pool. The run scraper will continue to run while there are still URLs to crawl. We do this by creating a while True loop and ignoring any exceptions except Empty, which will be thrown if our queue has been empty for more than 60 seconds.
We keep pulling URLs from our queue and submitting them to our thread pool for execution. We then add a callback which will run once the function has returned. This function in turns calls our parse link and scrape info functions. This will continue to work until we run out of URLs.
We simply add a main block at the bottom of our script to run the function.

Performance

When testing this script on several sites with performant servers, I was able to crawl several thousand URLs a minute with only 20 threads. Ideally, you would use a lower number of threads to avoid potentially overloading the site you are scraping.

Performance could be further improved by using XPath and ‘lxml’ to extract links from the site. This is due ‘lxml’ being written in Cython and considerably faster than BeautifulSoup which uses a pure Python solution.

Full Code

 

Leave a Reply

Your email address will not be published. Required fields are marked *