Optimal Number of Threads in Python

One of the most infamous features of Python is the GIL (Global Interpreter Lock), this means thread performance is significantly limited. The GIL responsible for protecting access to Python objects, this is because CPython’s memory management is not thread safe. Threads in Python essentially interrupt one another, meaning that only one thread has access to Python objects at one time. In many situations this can cause a program to run slower using threads, particularly if these threads are doing CPU bound work. Despite their limitations Python threads can be very performant for I/O bound work such as making a request to a web server. Unlike highly concurrent languages such as Golang and Erlang we cannot launch thousands of Goroutines or Erlang ‘processes’.

This makes it hard to determine the correct number of threads we should use to boost performance. At some point adding more threads is likely to degrade overall performance. I wanted to take a look at the optimal number of threads for an I/O bound task, namely making a HTTP request to a web server.

The Code

I wrote a short script using Python 3.6, requests and the concurrent futures library which makes a get request to the top 1,000 sites according to Amazon’s Alexa Web rankings. I then reran the script using a different number of threads, to see where performance would begin to drop off. To take account of uncontrollable variables, I re-ran each number of threads 5 times to produce an average.

The Results

As you can see as we first start increasing the number of threads used by our demo program, the number HTTP requests we can make per second increases quite rapidly. The increase performance starts dropping once we reach around 50 threads. Finally, once we use a total of 60 threads we actually start to see our HTTP request rate decrease, before it again starts get to faster as we approach a total 200 threads.

This is presumably due to the GIL. Once we add a significant amount of threads, each of the threads are essentially interfering with one another slowing down our program. In order to increase the performance of our code we would have to look into ways of releasing the GIL, such as writing a C extension or releasing the GIL within Cython code. I highly recommend watching this talk from David Beazley for those looking to get a better understanding of the GIL.

Despite the GIL our best result saw us make a total of 1,000 HTTP Get requests in a just a total of nine seconds. This sees us making a total of 111 HTTP requests per second! Which isn’t too bad for what is meant to be a slow language.

Caveats

The results from this experiment suggest that those writing threaded Python applications, should certainly take some time running tests to determine the optimum number of threads. The example used to run this test used I/O bound code, with little CPU overhead. Those running code with a greater amount of CPU bound code may find that they get less benefit from upping the number of threads. Despite, this I hope that this thread encourages people to look into using threads within their application. The increases performance achievable will depend highly on what exactly is being done within the threads.

There is also reason to believe that the optimal number of threads may differ from machine to machine. Which is another reason why it is certainly worth taking the time to test out a varying number of threads when you need to achieve maximum performance.

Ultimate Introduction to Web Scraping in Python: From Novice to Expert

Python is one of the most accessible fully featured programming languages, which makes it a perfect language for those looking to learn to program. This post aims to introduce the reader to web scraping, allowing them to build their own scrapers and crawlers to collect data from the internet.

Contents

  1. Introduction to Web Scraping
  2. Making HTTP Requests with Python
  3. Handling HTTP Errors
  4. Parsing HTML with BeautifulSoup

MORE TO COME

Introduction to Web Scraping

Web Scraping, sometimes referred to screen scraping is the practice of using programs to visit websites an extract information from them. This allows users to collect information from the web in a programmatic manner, as opposed to having to manually visit a page and extract the information into some sort of data store. At their core, major search engines such as Google, and Bing make use of web scraping to extract information from millions of pages every day.
Web scraping has a wide range of uses, including but not limited to fighting copyright infringement, collecting business intelligence, collecting data for data science, and for use within the fintech industry. This mega post is aimed at teaching you how to build scrapers and crawlers which will allow you to extract data from a wide range of sites.

This post assumes that you have Python 3.5+ installed and you have learnt how to install libraries via Pip. If not, it would be a good time to Google ‘how to install python’ and ‘how to use pip’. Those familiar with the requests library may want to skip ahead several parts.

Making HTTP Requests with Python

When accessing a website our browser makes a number of HTTP requests in the background. The majority of internet users aren’t aware of the number of HTTP requests required to access a web page. These requests load the page itself and may make additional requests to resources which our loaded by the page such as images, videos and style sheets. You can see a breakdown of the requests made by opening up your browser’s development tools and navigating to the ‘Network’ tab.

The majority of requests made to a website are made using a ‘GET’ request. As the name suggests a ‘GET’ request attempts to retrieve the content available at the specified address. HTTP supports a variety of other methods such as ‘POST’, ‘DELETE’, ‘PUT’ and ‘OPTIONS’. These methods are sometimes referred to as HTTP verbs. We will discuss these methods later.

Python’s standard library contains a module which allows us to make HTTP requests. While this library is perfectly functional the user interface is not particularly friendly. In this mega post we are going to make use of the requests library which provides us with a much friendlier user-interface and can be installed using the command below.

Making a HTTP request with Python can be done in a couple of lines. Below we are going to demonstrate how to make a request and walk through the code line by line.

First, we import the requests library which gives us access to the functions contained within the library. We then make a HTTP request to ‘https://edmundmartin.com’ using the ‘GET’ verb, by calling the get method contained within the requests library. We store the result of this request in a variable named ‘response_object’. The response object contains a number of pieces of information that are useful when scraping the web. Here, we access the text (HTML) of the response which we print to the screen.  Provided the site is up and available users running this script should be greeted with a wall of HTML.

Handling HTTP Errors

When making HTTP requests there is significant room for things to go wrong. Your internet collection may be down or the site in question may not be reachable. When scraping the internet we typically want to handle this errors and continue on without crashing our program.  For this we are going to want to write a function, which will allow us to make a HTTP request and deal with any errors. Additionally, by encapsulating this logic within a function we can reuse our code with greater ease, by simply calling the function every time we want to make HTTP request.

The below code is an example of a function which makes a request and deals with a number of common errors we are likely to encounter. The function is explained in more detail below.

Our basic get_request function takes one argument, the string of the URL we want to retrieve. We then make the request just as before. This time however our request is wrapped in a try and except block which allows to catch any errors should something go wrong. After the request we then then check the status code of our response. Every time you make a request the server in question will respond with a code indicating whether the request has been a success or not. If everything went fine then you will receive a 200 status code, otherwise you are likely to receive a 404 (‘Page Not Found’) or 503 (‘Service Unavailable’). By default, the requests library does not throw an error should a web server respond with a bad status code but rather continues silently. By using raise_for_status we force an error should we receive a bad status code. Should there be no error thrown we then return our response object.

If all did not go so well, we then handle all of our errors. Firstly, we check whether the page responded with a non-200 status code, by catching the requests.HTTPError. We then check whether the request failed due to a bad connection by checking for the requests.ConnectionError exception. Finally, we use the generic requests.RequestException to catch all other exceptions that can be thrown by the requests library. The ordering of our exceptions is important, the requests.RequestException is the most generic and would catch either of the other exceptions. Should this have been the first exception handled, the other lines of code would never ever run regardless of the reason for the exception.

When handling each exception, we use the standard library’s logging library to print out a message of what went wrong when making the request. This is very handy and is a good habit to get into, as it makes debugging programs much easier. If an exception is thrown we return nothing from our function, which we can then check later. Otherwise we return the response.

At the bottom of the script, I have provided a simple example of how this function could be used to print out a page’s HTML response.

Parsing HTML with BeautifulSoup

So everything we have done has been rather boring and not particularly useful. This is due to the fact that we have just been making requests and then printing the HTML. We can however do much more interesting things we our responses.

This is where BeautifulSoup comes in. BeautifulSoup is a library for the parsing of HTML, allowing us to easily extract the elements of the page that we are most interested in. While BeautifulSoup is not the fastest way to parse a page, it has a very beginner friendly API. The BeautifulSoup library can be install by using the following:

The code below expands on the code we wrote in the previous section and actually uses our response for something.  A full explanation can be found after the code snippet.

The code snippet above uses the same get_request function as before which I have removed for the sake of brevity. Firstly, we must import the BeautifulSoup library, we do this by adding the line ‘from bs4 import BeautifulSoup’. Doing this gives us access to the BeautifulSoup class which is used for parsing HTML responses. We then generate a ‘soup’ by passing our HTML to BeautifulSoup, here we also pass a string signifying the underlying html parsing library to be used. This is not required but BeautifulSoup will print a rather long winded warning should you omit this.

Once the soup has been created we can then use the ‘find_all’ method to discover all of the elements matching our search. The soup object, also has a method ‘find’ which will only return the first element matching our search. In this example, we first pass in the name of the HTML element we want to select. In this case it’s the heading 2 element represented in HTML by ‘h2’. We then pass a dictionary containing additional information. On my blog all article titles are ‘h2’ elements, with the ‘class’ of ‘entry-title’. This class attribute is what is used by CSS to make the titles stand out from the rest of the page, but can help us in selecting the elements of the page which we want.

Should our selector find anything we should be returned with a list of title elements. We can then write a for loop, which goes through each of these titles and prints the text of the title, by calling the get_text() method. A note of caution, should your selector not find anything calling the get_text() method on the result will throw an exception.  Should everything run without any errors the code snippet above should return the titles of the ten most recent articles from my website. This is all that is really required to get started with extracting information from websites, though picking the correct selector can take a little bit of work.

In the next section we are going to write a scraper which will extract information from Google, using what we have learnt so far.

Selenium Tips & Tricks in Python

Selenium is a great tool and can be used for a variety of different purposes. It can sometimes however be a bit tricky to make Selenium behave exactly how you want. This article shows you how you can make the most of the libraries advanced features, to make your life easier and help you extract data from websites.

Running Chrome Headless

Provided you have one of the latest versions of Chromdriver, it is now very easy to run selenium headless. This allows you to run the browser in the background without a visible window. We can simply add a couple of lines code to our browser on start-up and accessing webpages with selenium running quietly in the background. It should be noted that some sites can detect whether you are running Chrome headless and may block you from accessing content.

Using A Proxy With Selenium

There are occasions that you may want to use a proxy with Selenium. To use a proxy with Selenium we simply add an argument to Chrome Options when initialing our Selenium instance. Unfortunately, there is no way to change the proxy used once set. This means to rotate proxies while using Selenium, you have to either restart the Selenium browser or use a rotating proxy service which can come with it’s own set of issues.

Accessing Content Within An Iframe

Sometimes the content we want to extract from a website may be buried within an iframe. By default when you ask Selenium to return you the html content of a page, you will miss out on all the information contained within any iframes on the page. You can however access content contained within the iframe.

To switch to the iframe we want to extract data from, we first use Selenium’s find_element method. I would recommend using find_element_by_css_selector method which tends to be more reliable than trying to extract content by using an xpath selector. We then pass our target to a method which allows us to switch the browsers context to our target iframe. We can then access the HTML content and interact with content within the iframe. If we want to revert back to our original context, we simply call the revert to default content by switching to the default content.

Accessing Slow Sites

The modern web is overloaded with JavaScript, and this can cause Selenium to throw a lot of timeout errors, with Selenium timing out if a page takes more than 20 seconds to load. The simplest way to deal with this to increase Selenium’s default timeout. This is particularly useful when trying to access sites via a proxy, which slow down your connection speed.

Scrolling

Selenium by default does not allow users to scroll down pages. The browser automation framework does however allow users to execute JavaScript. This makes it very easy to scroll down pages, this is particularly useful when trying to scrape content from a page which continues to load content as the user scrolls down.

For some reason Selenium can be funny with executing window scroll commands, and it is sometimes necessary to call the command in a loop in order to scroll down the entirety of a page.

Executing JavaScript & Returning The Result

While many users of Selenium know that is is possible to run JavaScript allowing for more complicated interactions with the page, fewer know that it is also possible to return the result of executed JavaScript. This allows your browser to execute functions defined in the pages DOM and return the results to your Python script. This can be great for extracting data from tough to scrape websites.

Multi-Threaded Crawler in Python

Python is a great language for writing web scrapers and web crawlers. Libraries such as BeauitfulSoup, requests and lxml make grabbing and parsing a web page very simple. By default, Python programs are single threaded. This can make scraping an entire site using a Python crawler painfully slow. We must wait for each page to load before moving onto the next one. Thankfully, Python supports threads which while not appropriate for all tasks, can help us increase the performance of our web crawler.

In this post we are going to outline how you can build a simple multi-threaded crawler which will crawl an entire site using requests, BeautifulSoup and the standard library’s concurrent futures library.

Imports

We are going to begin by importing all the libraries we need for scraping. Both requests and BeautifulSoup are not included within the Python standard library. So you are going to have to install them if you haven’t done already. The other libraries you should already have available to you, if you are using Python3.

Setting Up Our Class

We start by initialising the class we are going to use to create our web crawler. Our initialisation statement only takes one argument. We pass our start URL as an argument, and from this we use urlparse from the urllib.parse library to pull out the sites homepage. This root URL is going to be used later to ensure that our scraper doesn’t end up on other sites.

We also initialise a thread pool. We are later going to submit ‘tasks’ to this thread pool, allowing us to use a callback function to collect our results. This will allow us to continue with execution of our main program, while we await the response from the website.

We also initialise a set which is going to contain a list of all the URLs which we have crawled. We will use this store URLs which have already been crawled, to prevent the crawler from visiting the same URL twice.

We then finally create a Queue which will contain URLs we wish to crawl, we will continue to grab URLs from our queue until it’s empty. Finally, we place in our base URL to the start of the queue.

Parsing Links and Scraping

Next, we write a basic link parser. Our goal here is to extract all of a sites internal links and not to pull out any external links. Additionally, we want to resolve relative URLs (those starting with ‘/’) and ensure that we don’t crawl them same URLs twice.

To do this we generate a Soup object using BeautifulSoup. We then use the find_all method to return every ‘a’ element which has a ‘href’ property. By doing this we ensure that we only return ‘a’ elements which contain a link. Our returned object is a list of dictionary objects which we then iterate through. First, we pull out the actual href content. We can then check whether this link is relative (starting with a ‘/’) or starts with our root URL. If this is the case we can then use URL join to generate a crawlable URL and then put this in our queue provided we haven’t already crawled it.

I have also included an empty scrape_info method which can be overridden so you can extract the data you want from the site you are crawling.

Defining Our Callback

The easiest and often most performant way to use a thread pool executor is to add a callback to the function we submit to the thread pool. This function will execute after the previous function has completed and will be passed the result of our previous function as an argument.

By calling .result() on the passed in argument we are able to get to the contents of our returned value. In our case this will be either ‘None’ or a requests response object. When then check if we have a result and whether the result has a 200 status code. If both of these turn out to be true we send the html to the parse_links and currently empty scrape_info function.

Scraping Pages

We then define our function which will be used to scrape the page. This function is very simple and simply takes a URL and returns a response object if it was successful. Otherwise we return ‘None’. By limiting the amount of CPU bound work we do in this function, we can increase the overall speed of our crawler. Threads are not recommended when doing CPU bound work and can actually turn out to be slower than using a single thread.

Run Scraper Function

The run scraper function brings all of our previous work together and manages our thread pool. The run scraper will continue to run while there are still URLs to crawl. We do this by creating a while True loop and ignoring any exceptions except Empty, which will be thrown if our queue has been empty for more than 60 seconds.
We keep pulling URLs from our queue and submitting them to our thread pool for execution. We then add a callback which will run once the function has returned. This function in turns calls our parse link and scrape info functions. This will continue to work until we run out of URLs.
We simply add a main block at the bottom of our script to run the function.

Performance

When testing this script on several sites with performant servers, I was able to crawl several thousand URLs a minute with only 20 threads. Ideally, you would use a lower number of threads to avoid potentially overloading the site you are scraping.

Performance could be further improved by using XPath and ‘lxml’ to extract links from the site. This is due ‘lxml’ being written in Cython and considerably faster than BeautifulSoup which uses a pure Python solution.

Full Code

 

Scraping JavaScript Heavy Pages with Python & Splash

Scraping the modern web can be particularly challenging. These days many websites make use of JavaScript frameworks to serve much of a pages important content. This breaks traditional scrapers as our scrapers are unable to extract the infromation we need from our initial HTTP request.

So what should we do when we come across a site that makes extensive use of JavaScript? One option is to use Selenium. Selenium provides us with an easy to use API, with which we can automate a web browser. This great for tasks when we need to interact with the page, whether that be to scroll or click certain elements. It is however a bit over the top when you simply want to render JavaScript.

Introducing Splash

Splash, is a JavaScript rendering service from the creators of the popular Scrapy framework. Splash can be run as a server on your local machine. The server built using Twisted and Python allows us to scrape pages using the servers HTTP API. This means we can render JavaScript pages without the need for a full browser. The use of Twisted also means we can also

Installing Splash

Full instructions for installing Splash can be found in the Splash docs. That being said, it is highly reccomended that you use Splash with Docker which makes starting and stopping the server very easy.

Building A Custom Python Crawler With Splash

Splash was designed to be used with Scrapy and Scrapinghub, but it can just as easily be used with Python. In this example we are going to build a multi-threaded crawler using requests and Beautiful Soup. We are going to scrape an e-commerce website which uses a popular JavaScript library to load product information on category pages.

Imports & Class Initialisation

To write this scraper we are only going to use two libraries outside of the standard library. If you have ever done any web scraping before, you are likely to have both Requests and BeautifulSoup installed. Otherwise go ahead and grab them using pip.

We then create a SplashScraper class. Our crawler only takes one argument, namely the URL we want to begin our crawl from. We then use the URL parse library to create a string holding the site’s root URL, we use this URL to prevent our crawler from scraping pages not on our base domain.

One of the main selling points of Splash, is the fact that it is asynchronous. This means that we can render multiple pages at a time, making our crawler significantly more performant than using a standalone instance of Selenium. To make the most of this we are going to use a ThreadPool to scrape pages, allowing us to make up to twenty simultaneous requests.

We create queue which are going to use to grab URLs from and send to be executed in our thread pool. We then create a set to hold a list of all the pages we have already queued. Finally, we put the base URL into our queue, ensuring we start crawling from the base URL.

Extracting Links & Parsing Page Data

Next we define two methods to use with our scraped HTML. Firstly, we take the HTML and extract all the links which contain a href attribute. We iterate over our list of links pulling out the href element. If the URL starts with a slash or starts with the site’s URL, we call urlparse’s urljoin method which creates an absolute link out of the two strings. If we haven’t already crawled this page, we then add the URL to the queue.

Our scrape_info method simple takes the HTML and scrapes certain information from the rendered HTML. We then use some relatively rough logic to pull out name and price information before writing this information a CSV file. This method can be overwritten with custom logic to pull out the particular information you need.

Grabbing A Page & Defining Our Callback

When using a thread pool executor, one of the best ways of getting the result out of a function which will be run in a thread is to use a callback. The callback will be run once the function run in the thread has completed. We define a super simple callback that unpacks our result, and then checks whether the page gave us a 200 status code. If the page responded with a 200 hundred, we then run both our parse_links and scrape_info methods using the page’s HTML.

Our scrape_page function is very simple. As we are simply making a request to a server running locally we don’t need any error handling. We simply pass in a URL, which is then formatted into the request. We then simple return the response object which will then be used in our callback function defined above.

Our Crawling Method

Our run_scraper method is basically our main thread. We continue to try and get links from our queue. In this particular example we have set a timeout of 120 seconds. This means that if we are unable to grab a new URL from the queue, we will raise an Empty error and quit the program. Once we have our URL, we check if it is not in the our set of already scraped pages before adding it to the list.  We then send of the URL for scraping and set our callback method to run once we have completed our scrape. We ignore any exception and continue on with our scraping until we have run out of pages we haven’t seen before.

The script in it’s entirety can be found here on Github.

Scraping Google with Python

In this post we are going to look at scraping Google search results using Python. There are a number of reasons why you might want to scrape Google’s search results. Some people scrape these results to determine how their sites are performing in Google’s organic rankings, while others use the data to look for security weaknesses, with their being plenty of different things you can do with the data available to you.

Scraping Google

Google allows users to pass a number of parameters when accessing their search service. This allows users to customise the results we receive back from the search engine. In this tutorial, we are going to write a script allowing us to pass a search term, number of results and a language filter.

Requirements

There a couple of requirements we are going to need to build our Google scraper. Firstly, you are going to need Python3. In addition to Python 3, we are going to need to install a couple of popular libraries; namely requests and Bs4. If you are already a Python user, you are likely to have both these libraries installed.

Grabbing Results From Google

First, we are going to write a function that grabs the HTML from a Google.com search results page. The function will take three arguments. A search term, the number of results to be displayed and a language code.

The first two lines our our fetch_results function assert whether the provided search term is a string and whether the number of results argument is an integer. This will see our function throw an Assertion Error, should the function be called with arguments of the wrong type.

We then escape our search term, with Google requiring that search phrases containing spaces be escaped with a addition character. We then use string formatting to build up a URL containing all the parameters originally passed into the function.

Using the requests library, we make a get request to the URL in question. We also pass in a User-Agent to the request to avoid being blocked by Google for making automated requests. Without passing a User-Agent to a request, you are likely to be blocked after only a few requests.

Once we get a response back from the server, we raise the response for a status code. If all went well the status code returned should be 200 Status OK. If however, Google has realised we are making automated requests we will be greeted by a captcha and 503 Forbidden page. If this happens an exception will be raised. Finally, our function returns the search term passed in and the HTML of the results page.

Parsing the HTML

Now we have grabbed the HTML we need to parse this html. Parsing the HTML, will allow us to extract the elements we want from the Google results page. For this we are using BeautifulSoup, this library makes it very easily to extract the data we want from a webpage.

All the organic search results on the Google search results page are contained within ‘div’ tags with the class of ‘g’. This makes it very easy for us to pick out all of the organic results on a particular search page.

Our parse results function begins by making a ‘soup’ out of the html we pass to it. This essentially just creates a DOM object out of a HTML string allowing to select and navigate through different page elements. When then initialise our results variable, which is going to be a list of dictionary elements. By making the results a list of dictionary elements we make it very easy to use the data in variety of different ways.

We then pick out of the results block using the selector already mentioned. Once we have these results blocks we iterate through the list, where try and pick out the link, title and description for each of our blocks. If we find both a link and title, we know that we have an organic search block. We then grab the href element of the link and the text of the description. Provided our found link is not equal to ‘#’, we simply add a dictionary element to our found results list.

Error Handling

We are now going to add error handling. There are a number of different errors that could be thrown and we look to catch all of these possible exceptions. Firstly, if you pass data for the wrong type to the fetch results function, an assertion error will be thrown. This function can also throw two more errors. Should we get banned we will be presented with a HTTP Error and should we have some sort of connection issue we will catch this using the generic requests exception.

We can then use this script in a number of different situations to scrape results from Google. The fact that our results data is a list of dictionary items, makes it very easy to write the data to CSV, or write to the results to a database. The full script can be found here.

Aiohttp – Background Tasks

Python gets a lot of flak for its performance story. However, the introduction of Aysncio into the standard library goes someway to resolving some of those performance problems. There is now a wide choice of libraries which make use of the new async/await syntax, including a number of server implementations.

The Aiohttp library comes with both a client and server. However, today I want to focus on the server and one of my favourite features – background tasks. Typically, when building a Python based micro-service with Flask, you might have a background task running in something like Celery. While these background tasks are more limited than Celery tasks, they allow you to run tasks in the background while still receiving requests.

A Simple Example

I have written code, which provides you with a simple example of how you can use such a background task. We are going to write a server that has one endpoint. This endpoint allows a user to post a JSON dictionary containing a URL. This URL is then sent to a thread pool where it is immediately scraped without blocking the users request. Once we have the data we need from the URL this placed in queue which is then processed by our background task which simply posts the data to another endpoint.

Get & Post Requests

For this we are going to need to implement a post and get request method. Our get request is going to be run in a thread pool so we can use the ever-popular requests library to grab our page. However, our post-request is going to be made inside an async background task, so itself must be asynchronous. Otherwise we would end up blocking the event loop.

Both of these small functions are pretty basic and don’t do much in the way of error or logging, but are enough to demonstrate the workings of a background task.

Initialising the Server and Setting Up Our Route

We begin by initialising our server class by passing a port and host. We also define a thread pool, which will use to run our synchronous get requests. If you had a long running CPU bound task, you could instead use a process pool in much the same way. We finally create a queue using deque from the collections module, allowing us to easily append and pop data from our queue. It is this queue that our background task will process. Finally, we have the example endpoint which we will use to post our data off to.

We then move onto defining our async view. This particular view is very simple, we simply await the JSON from the incoming request. Then we attempt to grab the URL from the provided JSON. If the JSON contains a URL, we send the URL to our get_request function which is then executed within the thread pool. This allows us to return a response to the person making a request without blocking. We add a call back to this which will be executed, once the request is completed. The callback simply puts the data in our queue which will be processed by our background task.

Creating & Registering The Background Task

Our background task is very simple. It is simply an async function which contains a while True loop. Inside this loop we check if there are any items in the queue to be post to our dummy server. If there are any items, we pop these items and make an async post request. If there are no items we await asyncio.sleep. This is very important. Without putting the await statement here, we could end up in a situation where our background task never gives up the event loop to incoming server requests.
We then define two async functions which simply take in our yet to be created app and then add a task to the event loop. This allows the background task to be run in the same event loop as the server and cancel when the server itself is shut down.

The Most Complicated Bit: Creating Our App

This part of the code is the most confusing. Our create app function simply returns a web app with our route added to the server’s homepage. In the run app method we then run this application forever within our event loop. Appending the tasks which are to be run on start up and shut down of the server. We can then finally pass our app to the web.run_app function to be run on our specified host and port.

Complete Code

We now have a simple server which takes requests, deals with them and then processes them in the background. This can be very powerful and can be used to create servers which can process long running tasks by using these tasks in conjunction with thread and process pools.

Selenium Based Crawler in Python

Today, we are going to walk through creating a basic crawler making use of Selenium.

Why Build A Selenium Web Crawler?

First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.

Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.

While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.

Imports & Class Initialisation

To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.

We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.

Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘https://edmundmartin.com’ to the base_url argument.

We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.

We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.

Getting Pages With Selenium

We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.

Creating a Soup

This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.

Getting Links

Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.

Getting Data

We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.

Writing to CSV

We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.

Run Crawler Method

The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.

Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.

We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.

What Can Be Improved?

There are a number of things that can be improved on in this example Selenium crawler.

While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.

Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.

The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.

Beautiful Soup vs. lxml – Speed

When comparing Python parsing frameworks, you often hear people complaining that Beautiful Soup is considerably slower than using lxml. Thus, some people conclude that lxml should be used in any performance critical project. Having used Beautiful Soup in a large number of web scraping projects and never having had any real trouble with its performance, I wanted to properly measure the performance of the popular parsing library.

The Test

To test the two libraries, I wrote a simple single threaded crawler which crawls a total of 100 URLs and then simply extracts links and the page title from the page in question. By implementing two different parser methods, one using lxml and one using Beautiful Soup. I also tested the speed of the Beautiful Soup with various non-default parsers.

Each of the various setups were tested a total of five times to account for varying internet and server response times, with the below results outlining the different performance based on library and underlying parser.

The Results

Run #1 Run #2 Run #3 Run #4 Run #5 Avg. Speed Overhead Per Page (Seconds)
lxml 38.49 36.82 39.63 39.02 39.84 38.76 N/A
Bs4 (html.parser) 49.17 52.1 45.35 47.38 47.07 48.21 0.09
Bs4 (html5lib) 54.61 53.17 53.37 56.35 54.12 54.32 0.16
Bs4 (lxml) 42.97 43.71 46.65 44.51 47.9 45.15 0.06

As you can see lxml is significantly faster than Beautiful Soup. A pure lxml solution is several seconds faster than using Beautiful Soup with lxml as the underlying parser. The built Python parsing library is around 10 seconds slower, whereas the extremely liberal html5lib is even slower. The overhead per page parsed is still relatively small with both bs4(html.parser) and bs4(lxml) adding less than 0.1 seconds per page parsed.

Overhead 100,000 URLs (Extra Hours) 500,000 URLs (Extra Hours)
Bs4 (html.parser) 0.09454 2.6 13.1
Bs4 (html5lib) 0.15564 4.3 21.6
Bs4 (lxml) 0.06388 1.8 8.9

While the overhead seems very low, when you try and scale a crawler using Beautiful Soup will add a significant overhead. Even using Beautiful Soup with lxml adds significant overhead when you are trying to scale to hundreds of thousands of URLs. It should be noted that the above table assumes a crawler running a single thread. Anyone looking to crawl more than 100,000 URLs would be highly recommended to build a concurrent crawler making use of a library such as Twisted, Asycnio, or Concurrent futures.

So, the question of whether Beautiful Soup is suitable for your project really depends on the scale and nature of the project. Replacing Beautiful Soup with lxml is likely to see you achieve a small (but considerable at scale) performance improvements. This does however come at the cost of losing the Beautiful Soup API, which makes selecting on-page elements a breeze.

Web Scraping: Avoiding Detection

 

This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.

Switching User Agent’s

Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.

You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.

This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.

Other Request Headers

Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.

Going Slow

Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.

Proxies

In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.

Rendering JavaScript

JavaScript is pretty much used everywhere and the number of human’s not enabling JavaScript in their browsers is less than <1%. This means that some sites have looked to block IPs making large numbers of requests without rendering JavaScript. The simple solution is just to render JavaScript, using a headless browser and browser automation suite such as Selenium.

Increasingly, companies such as Cloudflare are checking whether users making requests to the site are rendering JavaScript. By using this technique, they hope to block bots making requests to the site in question. However, several libraries now exist which help you get around the kind of protection implemented by Cloudflare. Python’s cloudflare-scrape library is a wrapper around the requests library which simply run’s Cloudflare’s JavaScript test within a node environment should it detect that such a protection has been put in place.

Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.

Backing Off

What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.

Avoiding Honeypots/Bot Traps

Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.

Obeying Robots.txt

Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.

Cookies

In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.

Captchas

Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.

By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.