Detecting Selenium

When looking to extract information from more difficult to scrape sites many programmers turn to browser automation tools such as Selenium and iMacros. At the time of writing, Selenium is by far the most popular option for those looking to leverage browser automation for information retrieval purposes. However, Selenium is very detectable and site owners would be able to block a large percentage of all Selenium users.

Selenium Detection with Chrome

When using Chrome, the Selenium driver injects a webdriver property into the browser’s navigator object. This means it’s possible to write a couple lines of JavaScript to detect that the user is using Selenium. The above code snippet simply checks whether webdriver is set to true and redirects the user should this be the case. I have never seen this technique used in the wild, but I can confirm that it seems to successfully redirect those using Chrome with Selenium.

Selenium Detection with Firefox

Older versions of Firefox used to inject a webdriver attribute into the HTML document. This means that older versions of Firefox could be very simply detected using the above code snippet. At the time of writing Firefox no longer adds this element to pages when using Selenium.

Additional methods of detecting Selenium when using Firefox have also been suggested. Testing seems to suggest that these do not work with the latest builds of Firefox. However, the webdriver standard suggests that this may eventually be implemented in Firefox again.

Selenium Detection with PhantomJS

All current versions of PhantomJS, add attributes to the window element. This allows site owners to simply check whether these specific PhantomJS attributes are set and redirect the user away when it turns out that they are using PhantomJS. It should also be noted that support for the PhantomJS project has been rather inconsistent and the project makes use on an outdated webkit version which is also detectable and could present a security list.

Avoiding Detection

Your best of avoiding detection when using Selenium would require you to use one of the latest builds of Firefox which don’t appear to give off any obvious sign that you are using Firefox. Additionally, it may be worth experimenting with both Safari and Opera which are much less commonly used by those scraping the web. It would also seem likely that Firefox may be giving off some less obvious footprint which would need further investigation to discover.

Scraping & Health Monitoring free proxies with Python

When web-scraping, you often need to source a number of proxies in order to avoid being banned or get around rate limiting imposed by the website in question. This often see’s developers purchasing proxies from some sort of commercial provider, this can become quite costly if you are only need the proxies for a short period of time. So in this post we are going to look at how you might use proxies from freely available proxy lists to scrape the internet.

Problems With Free Proxies

  • Free Proxies Die Very Quickly
  • Free Proxies Get Blocked By Popular Sites
  • Free Proxies Frequently Timeout

While free proxies are great in the sense that they are free, they tend to be highly unreliable. This is due to the fact that up-time is inconsistent and these proxies get blocked quickly by popular sites such as Google. Our solution is also going to build in some monitoring of the current status of the proxy in question. Allowing us to avoid using proxies which are currently broken.

Scraping Proxies

We are going to use free-proxy-list.net, as our source for this example. But the example could easily be expanded to cover multiple sources of proxies. We simply write a simple method which visits the page and pulls out all the proxies from the page in question using our chosen user-agent.  We then store the results in a dictionary, with each proxy acting as a key holding the information relating to that particular proxy. We are not doing any error handling, this will be handled in our ProxyManager class.

Proxy Manager

Our proxy manager is a simply class which allows us to get and manage the proxies we find on free-proxy-list.net. We pass in a test URL which will be used to test whether the proxy is working and a user agent to be used for both scraping and testing the proxies in question. We also create a thread pool, so we can more quickly check the status of the proxies we have scraped. We then call our update_proxy_list, returning the proxies we have found on free-proxy-list.net into our dictionary of proxies.

Checking Proxies

We can now write a couple of methods to test whether a particular proxy works. The first method takes the proxy and the dictionary of information related to the proxy in question. We immediately set the last checked variable to the current time. We make a request against our test URL, with a relatively short timeout. We also then check the status of the request raising an exception should we receive a non-200 status code. Should anything go wrong, we then set the status of the proxy to dead, otherwise we set the status to alive.

We then write our refresh proxy status which simple calls our check proxy status. We iterate over our dictionary, submitting each proxy and the related info of to a thread. If we didn’t use threads to check the status of our proxies, we could be waiting a very long time for our results. We then loop through our results and update the status of proxy in question.

Getting A Proxy

We then write two methods for getting ourselves a proxy. Our first method allows us to get a list of proxies by passing in a relevant key and value. This method allows us to get a list of proxies that relate to a particular country or boasts a particular level anonymity. This can be useful should we be interested in particular properties of a proxy.

We also have a simple method that allows us to return a single working proxy. This returns the first working proxy found within our proxy dictionary by looping over all the items in the dictionary, and returning the first proxy where ‘alive’ is equal to true.

Example Usage

Using the library is pretty simple. We just create the class passing in our test URL (using Google.com here) and our selected user-agent. We then call refresh_proxy_status, updating the status of the scraped proxies by running them against our test URL. We can then pull out an individual working proxy. We can then update our proxy list with a fresh scrape of our source should we not be satisfied with the proxies we currently have access to.

Full Code

Scraping Instagram with Python

In today’s post we are going how to look at how you can extract information from a users Instagram profile. It’s surprisingly easy to extract profile information such as the number of followers a user has and information and image files for a users most recent posts. With a bit of effort it would be relatively easy to extract large chunks of data regarding a user. This could then be applied at a very broad scale to extract a large chunk of all public posts featured on Instagram’s site.

Imports & Setup

We begin by making our imports and writing the dunder init method for our class. Our code requires two packages not included in the standard library, requests for making HTTP Requests and BeautifulSoup to make html parsing more user friendly. If you do not already have these libraries install, you can use the following pip command:

The init method of our class takes two optional keyword arguments, which we simply store in self. This will allow us to override the default user agent list and use a proxy should we wish to avoid detection.

We then write two helper methods. First, we write a very simply method that returns us a random user-agent. Switching user agents is often a best practice when web scraping and can help you avoid detection. Should the caller of our class have provided their own list of user agents we take a random agent from the provided list.  Otherwise we will return our default user agent.

Our second helper method is simply a wrapper around requests. We pass in a URL and try to make a request using the provided user agent and proxy. If we are unable to make the request or Instagram responds with a non-200 status code we simply re-raise the error. If everything goes fine, we return the page in questions HTML.

Extracting JSON from JavaScript

Instagram serve’s all the of information regarding a user in the form of JavaScript object. This means that we can extract all of a users profile information and their recent posts by just making a HTML request to their profile page. We simply need to turn this JavaScript object into JSON, which is very easy to do.

We can write this very hacky, but effective method to extract JSON from a user profile. We apply the static method decorator to this function, as it’s possible to use this method without initializing our class. We simply create a soup from the HTML, select body of the content and then pull out the first ‘script’ tag. We can then simply do a couple text replacements on the script tag, to derive a string which can be loaded into a dictionary object using the json.loads method.

Bringing it all together

We then bring it all together in two functions which we can use to extract information from this very large JSON object. We first make a request to the page, before extracting the JSON result. We then use two different selectors to pull out the relevant bits of information, as the default JSON object has lots of information we don’t really need.

When extracting profile information we extract all attributes from the “user” object, excluding their recent posts. In the “recent posts” function, we use a slightly different selector and pull out all the information about all of the recent posts made by our targeted user.

Example Usage

We can then use the Instagram scraper in a very simply fashion to pull out all the most recent posts from our favorite users in a very simple fashion. You could do lots of things with the resulting data, which could be used in Instagram analytics app for instance or you could simply programmatically download all the images relating to that user.

There is certainly room for improvement and modification. It would also be possible to use Instagram’s graph API, to pull out further posts from a particular user or pull out lists of a users recent followers etc. Allowing you to collect large amounts of data, without having to deal with Facebook’s restrictive API limitations and policies.

Full Code

Scraping Baidu with Python

 

What’s Baidu?

Baidu is China’s largest search engine and has been since Google left the market in {year}. As companies look to move into the Chinese market, there has been more and more interest in scraping search results from Baidu.

Scraping Baidu

Scraping Baidu is a relatively simple task. When scraping results from Baidu there is only minor challenge, the URLs displayed on the Baidu results page are found nowhere in the HTML. Baidu links to the sites displayed on the search results page via their own redirector service. In order to get the full final URL we have to follow these redirects. In this post we are going to walk through how to scrape the Baidu search results page.

Imports & Class Definition

In order to scrape Baidu, we only need to import two libraries outside of the standard library. Bs4 helps us parse HTML, while requests provides us with a nicer interface for making HTTP requests with Python.

As we are going to scrape multiple pages of Baidu in this tutorial and for this purpose we are going to initialise a class to hold onto the important information for us.

We initialise a new class of the BaiduBot, with a search term and the number of pages to scrape. We also give ourselves the ability to pass a number of keyword arguments to our class. This allows us to pass a proxy, a custom connection timeout, custom user agent and an optional delay between each of the results page we want to scrape. The keyword arguments may be of a lot of help, if we end up being block by Baidu.  When initialising the class we also store our base URL, which we use when scraping the subsequent pages.

Making Requests & Parsing HTML

We first define a function to scrape a page of Baidu, here we simply try to make a request and check that the response has a 200 Status. Should Baidu start serving us with non-200 status codes, this likely means that they have detected unusual behaviour from our IP and we should probably back off for a while. If there is no issue with the request, we simply return the response object.

Now that we have a way to make HTML requests, we need to write a method for parsing the results page. Our parser is going to take in the HTML and return us with a list of dictionary objects. Each result is handily contained within a ‘div’ called ‘c-container’. This makes it very easy for us to pick out each result. We can then iterate across all of our returned results, using relatively simply BeautifulSoup selectors. Before appending the result to our results list.

Getting the Underlying URL

As previously mentioned the full underlying URL is not displayed anywhere in Baidu’s search results. This means we must write a couple of functions to extract the full underlying URL. There may be another way to get this URL, but I’m not aware of it. If you know how, please share the method with me in the comments.

Our resolve_urls function is very similar to our Baidu request function. Instead of a response object we are returning the final URL by simply following the chain of redirects. Should we encounter any sort of error we are simply returning the original URL, as found within the search results. But this issue is relatively rare, so it shouldn’t impact our data too much.

The we write another function that allows us to use our resolve_urls function over a set of results, updating the URL within our dictionary with the real underlying URL and the rank of the URL in question.

Bringing It All Together

We bring this altogether in our scrape_baidu function. We range over our page count variable. For each loop we run through we multiple by our variable by 10, to get the correct pn variable. The pn variable represents the result index, so our logic ensures we start at 0 and continue on in 10 result increments. We then format our URL using both our search term and this variable. We then simply make the request and parse the page using the functions we have already written. Before appending the results to our final results variable. Should we have passed a delay argument, we will also sleep for a while before scraping the next page. This will help us avoided getting banned should we want to scrape multiple pages and search terms.

Full Code

 

Ultimate Introduction to Web Scraping in Python: From Novice to Expert

Python is one of the most accessible fully featured programming languages, which makes it a perfect language for those looking to learn to program. This post aims to introduce the reader to web scraping, allowing them to build their own scrapers and crawlers to collect data from the internet.

Contents

  1. Introduction to Web Scraping
  2. Making HTTP Requests with Python
  3. Handling HTTP Errors
  4. Parsing HTML with BeautifulSoup

MORE TO COME

Introduction to Web Scraping

Web Scraping, sometimes referred to screen scraping is the practice of using programs to visit websites an extract information from them. This allows users to collect information from the web in a programmatic manner, as opposed to having to manually visit a page and extract the information into some sort of data store. At their core, major search engines such as Google, and Bing make use of web scraping to extract information from millions of pages every day.
Web scraping has a wide range of uses, including but not limited to fighting copyright infringement, collecting business intelligence, collecting data for data science, and for use within the fintech industry. This mega post is aimed at teaching you how to build scrapers and crawlers which will allow you to extract data from a wide range of sites.

This post assumes that you have Python 3.5+ installed and you have learnt how to install libraries via Pip. If not, it would be a good time to Google ‘how to install python’ and ‘how to use pip’. Those familiar with the requests library may want to skip ahead several parts.

Making HTTP Requests with Python

When accessing a website our browser makes a number of HTTP requests in the background. The majority of internet users aren’t aware of the number of HTTP requests required to access a web page. These requests load the page itself and may make additional requests to resources which our loaded by the page such as images, videos and style sheets. You can see a breakdown of the requests made by opening up your browser’s development tools and navigating to the ‘Network’ tab.

The majority of requests made to a website are made using a ‘GET’ request. As the name suggests a ‘GET’ request attempts to retrieve the content available at the specified address. HTTP supports a variety of other methods such as ‘POST’, ‘DELETE’, ‘PUT’ and ‘OPTIONS’. These methods are sometimes referred to as HTTP verbs. We will discuss these methods later.

Python’s standard library contains a module which allows us to make HTTP requests. While this library is perfectly functional the user interface is not particularly friendly. In this mega post we are going to make use of the requests library which provides us with a much friendlier user-interface and can be installed using the command below.

Making a HTTP request with Python can be done in a couple of lines. Below we are going to demonstrate how to make a request and walk through the code line by line.

First, we import the requests library which gives us access to the functions contained within the library. We then make a HTTP request to ‘http://edmundmartin.com’ using the ‘GET’ verb, by calling the get method contained within the requests library. We store the result of this request in a variable named ‘response_object’. The response object contains a number of pieces of information that are useful when scraping the web. Here, we access the text (HTML) of the response which we print to the screen.  Provided the site is up and available users running this script should be greeted with a wall of HTML.

Handling HTTP Errors

When making HTTP requests there is significant room for things to go wrong. Your internet collection may be down or the site in question may not be reachable. When scraping the internet we typically want to handle this errors and continue on without crashing our program.  For this we are going to want to write a function, which will allow us to make a HTTP request and deal with any errors. Additionally, by encapsulating this logic within a function we can reuse our code with greater ease, by simply calling the function every time we want to make HTTP request.

The below code is an example of a function which makes a request and deals with a number of common errors we are likely to encounter. The function is explained in more detail below.

Our basic get_request function takes one argument, the string of the URL we want to retrieve. We then make the request just as before. This time however our request is wrapped in a try and except block which allows to catch any errors should something go wrong. After the request we then then check the status code of our response. Every time you make a request the server in question will respond with a code indicating whether the request has been a success or not. If everything went fine then you will receive a 200 status code, otherwise you are likely to receive a 404 (‘Page Not Found’) or 503 (‘Service Unavailable’). By default, the requests library does not throw an error should a web server respond with a bad status code but rather continues silently. By using raise_for_status we force an error should we receive a bad status code. Should there be no error thrown we then return our response object.

If all did not go so well, we then handle all of our errors. Firstly, we check whether the page responded with a non-200 status code, by catching the requests.HTTPError. We then check whether the request failed due to a bad connection by checking for the requests.ConnectionError exception. Finally, we use the generic requests.RequestException to catch all other exceptions that can be thrown by the requests library. The ordering of our exceptions is important, the requests.RequestException is the most generic and would catch either of the other exceptions. Should this have been the first exception handled, the other lines of code would never ever run regardless of the reason for the exception.

When handling each exception, we use the standard library’s logging library to print out a message of what went wrong when making the request. This is very handy and is a good habit to get into, as it makes debugging programs much easier. If an exception is thrown we return nothing from our function, which we can then check later. Otherwise we return the response.

At the bottom of the script, I have provided a simple example of how this function could be used to print out a page’s HTML response.

Parsing HTML with BeautifulSoup

So everything we have done has been rather boring and not particularly useful. This is due to the fact that we have just been making requests and then printing the HTML. We can however do much more interesting things we our responses.

This is where BeautifulSoup comes in. BeautifulSoup is a library for the parsing of HTML, allowing us to easily extract the elements of the page that we are most interested in. While BeautifulSoup is not the fastest way to parse a page, it has a very beginner friendly API. The BeautifulSoup library can be install by using the following:

The code below expands on the code we wrote in the previous section and actually uses our response for something.  A full explanation can be found after the code snippet.

The code snippet above uses the same get_request function as before which I have removed for the sake of brevity. Firstly, we must import the BeautifulSoup library, we do this by adding the line ‘from bs4 import BeautifulSoup’. Doing this gives us access to the BeautifulSoup class which is used for parsing HTML responses. We then generate a ‘soup’ by passing our HTML to BeautifulSoup, here we also pass a string signifying the underlying html parsing library to be used. This is not required but BeautifulSoup will print a rather long winded warning should you omit this.

Once the soup has been created we can then use the ‘find_all’ method to discover all of the elements matching our search. The soup object, also has a method ‘find’ which will only return the first element matching our search. In this example, we first pass in the name of the HTML element we want to select. In this case it’s the heading 2 element represented in HTML by ‘h2’. We then pass a dictionary containing additional information. On my blog all article titles are ‘h2’ elements, with the ‘class’ of ‘entry-title’. This class attribute is what is used by CSS to make the titles stand out from the rest of the page, but can help us in selecting the elements of the page which we want.

Should our selector find anything we should be returned with a list of title elements. We can then write a for loop, which goes through each of these titles and prints the text of the title, by calling the get_text() method. A note of caution, should your selector not find anything calling the get_text() method on the result will throw an exception.  Should everything run without any errors the code snippet above should return the titles of the ten most recent articles from my website. This is all that is really required to get started with extracting information from websites, though picking the correct selector can take a little bit of work.

In the next section we are going to write a scraper which will extract information from Google, using what we have learnt so far.

Selenium Tips & Tricks in Python

Selenium is a great tool and can be used for a variety of different purposes. It can sometimes however be a bit tricky to make Selenium behave exactly how you want. This article shows you how you can make the most of the libraries advanced features, to make your life easier and help you extract data from websites.

Running Chrome Headless

Provided you have one of the latest versions of Chromdriver, it is now very easy to run selenium headless. This allows you to run the browser in the background without a visible window. We can simply add a couple of lines code to our browser on start-up and accessing webpages with selenium running quietly in the background. It should be noted that some sites can detect whether you are running Chrome headless and may block you from accessing content.

Using A Proxy With Selenium

There are occasions that you may want to use a proxy with Selenium. To use a proxy with Selenium we simply add an argument to Chrome Options when initialing our Selenium instance. Unfortunately, there is no way to change the proxy used once set. This means to rotate proxies while using Selenium, you have to either restart the Selenium browser or use a rotating proxy service which can come with it’s own set of issues.

Accessing Content Within An Iframe

Sometimes the content we want to extract from a website may be buried within an iframe. By default when you ask Selenium to return you the html content of a page, you will miss out on all the information contained within any iframes on the page. You can however access content contained within the iframe.

To switch to the iframe we want to extract data from, we first use Selenium’s find_element method. I would recommend using find_element_by_css_selector method which tends to be more reliable than trying to extract content by using an xpath selector. We then pass our target to a method which allows us to switch the browsers context to our target iframe. We can then access the HTML content and interact with content within the iframe. If we want to revert back to our original context, we simply call the revert to default content by switching to the default content.

Accessing Slow Sites

The modern web is overloaded with JavaScript, and this can cause Selenium to throw a lot of timeout errors, with Selenium timing out if a page takes more than 20 seconds to load. The simplest way to deal with this to increase Selenium’s default timeout. This is particularly useful when trying to access sites via a proxy, which slow down your connection speed.

Scrolling

Selenium by default does not allow users to scroll down pages. The browser automation framework does however allow users to execute JavaScript. This makes it very easy to scroll down pages, this is particularly useful when trying to scrape content from a page which continues to load content as the user scrolls down.

For some reason Selenium can be funny with executing window scroll commands, and it is sometimes necessary to call the command in a loop in order to scroll down the entirety of a page.

Executing JavaScript & Returning The Result

While many users of Selenium know that is is possible to run JavaScript allowing for more complicated interactions with the page, fewer know that it is also possible to return the result of executed JavaScript. This allows your browser to execute functions defined in the pages DOM and return the results to your Python script. This can be great for extracting data from tough to scrape websites.

Scraping JavaScript Heavy Pages with Python & Splash

Scraping the modern web can be particularly challenging. These days many websites make use of JavaScript frameworks to serve much of a pages important content. This breaks traditional scrapers as our scrapers are unable to extract the infromation we need from our initial HTTP request.

So what should we do when we come across a site that makes extensive use of JavaScript? One option is to use Selenium. Selenium provides us with an easy to use API, with which we can automate a web browser. This great for tasks when we need to interact with the page, whether that be to scroll or click certain elements. It is however a bit over the top when you simply want to render JavaScript.

Introducing Splash

Splash, is a JavaScript rendering service from the creators of the popular Scrapy framework. Splash can be run as a server on your local machine. The server built using Twisted and Python allows us to scrape pages using the servers HTTP API. This means we can render JavaScript pages without the need for a full browser. The use of Twisted also means we can also

Installing Splash

Full instructions for installing Splash can be found in the Splash docs. That being said, it is highly reccomended that you use Splash with Docker which makes starting and stopping the server very easy.

Building A Custom Python Crawler With Splash

Splash was designed to be used with Scrapy and Scrapinghub, but it can just as easily be used with Python. In this example we are going to build a multi-threaded crawler using requests and Beautiful Soup. We are going to scrape an e-commerce website which uses a popular JavaScript library to load product information on category pages.

Imports & Class Initialisation

To write this scraper we are only going to use two libraries outside of the standard library. If you have ever done any web scraping before, you are likely to have both Requests and BeautifulSoup installed. Otherwise go ahead and grab them using pip.

We then create a SplashScraper class. Our crawler only takes one argument, namely the URL we want to begin our crawl from. We then use the URL parse library to create a string holding the site’s root URL, we use this URL to prevent our crawler from scraping pages not on our base domain.

One of the main selling points of Splash, is the fact that it is asynchronous. This means that we can render multiple pages at a time, making our crawler significantly more performant than using a standalone instance of Selenium. To make the most of this we are going to use a ThreadPool to scrape pages, allowing us to make up to twenty simultaneous requests.

We create queue which are going to use to grab URLs from and send to be executed in our thread pool. We then create a set to hold a list of all the pages we have already queued. Finally, we put the base URL into our queue, ensuring we start crawling from the base URL.

Extracting Links & Parsing Page Data

Next we define two methods to use with our scraped HTML. Firstly, we take the HTML and extract all the links which contain a href attribute. We iterate over our list of links pulling out the href element. If the URL starts with a slash or starts with the site’s URL, we call urlparse’s urljoin method which creates an absolute link out of the two strings. If we haven’t already crawled this page, we then add the URL to the queue.

Our scrape_info method simple takes the HTML and scrapes certain information from the rendered HTML. We then use some relatively rough logic to pull out name and price information before writing this information a CSV file. This method can be overwritten with custom logic to pull out the particular information you need.

Grabbing A Page & Defining Our Callback

When using a thread pool executor, one of the best ways of getting the result out of a function which will be run in a thread is to use a callback. The callback will be run once the function run in the thread has completed. We define a super simple callback that unpacks our result, and then checks whether the page gave us a 200 status code. If the page responded with a 200 hundred, we then run both our parse_links and scrape_info methods using the page’s HTML.

Our scrape_page function is very simple. As we are simply making a request to a server running locally we don’t need any error handling. We simply pass in a URL, which is then formatted into the request. We then simple return the response object which will then be used in our callback function defined above.

Our Crawling Method

Our run_scraper method is basically our main thread. We continue to try and get links from our queue. In this particular example we have set a timeout of 120 seconds. This means that if we are unable to grab a new URL from the queue, we will raise an Empty error and quit the program. Once we have our URL, we check if it is not in the our set of already scraped pages before adding it to the list.  We then send of the URL for scraping and set our callback method to run once we have completed our scrape. We ignore any exception and continue on with our scraping until we have run out of pages we haven’t seen before.

The script in it’s entirety can be found here on Github.

Selenium Based Crawler in Python

Today, we are going to walk through creating a basic crawler making use of Selenium.

Why Build A Selenium Web Crawler?

First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.

Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.

While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.

Imports & Class Initialisation

To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.

We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.

Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘http://edmundmartin.com’ to the base_url argument.

We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.

We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.

Getting Pages With Selenium

We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.

Creating a Soup

This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.

Getting Links

Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.

Getting Data

We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.

Writing to CSV

We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.

Run Crawler Method

The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.

Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.

We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.

What Can Be Improved?

There are a number of things that can be improved on in this example Selenium crawler.

While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.

Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.

The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.

Beautiful Soup vs. lxml – Speed

When comparing Python parsing frameworks, you often hear people complaining that Beautiful Soup is considerably slower than using lxml. Thus, some people conclude that lxml should be used in any performance critical project. Having used Beautiful Soup in a large number of web scraping projects and never having had any real trouble with its performance, I wanted to properly measure the performance of the popular parsing library.

The Test

To test the two libraries, I wrote a simple single threaded crawler which crawls a total of 100 URLs and then simply extracts links and the page title from the page in question. By implementing two different parser methods, one using lxml and one using Beautiful Soup. I also tested the speed of the Beautiful Soup with various non-default parsers.

Each of the various setups were tested a total of five times to account for varying internet and server response times, with the below results outlining the different performance based on library and underlying parser.

The Results

Run #1 Run #2 Run #3 Run #4 Run #5 Avg. Speed Overhead Per Page (Seconds)
lxml 38.49 36.82 39.63 39.02 39.84 38.76 N/A
Bs4 (html.parser) 49.17 52.1 45.35 47.38 47.07 48.21 0.09
Bs4 (html5lib) 54.61 53.17 53.37 56.35 54.12 54.32 0.16
Bs4 (lxml) 42.97 43.71 46.65 44.51 47.9 45.15 0.06

As you can see lxml is significantly faster than Beautiful Soup. A pure lxml solution is several seconds faster than using Beautiful Soup with lxml as the underlying parser. The built Python parsing library is around 10 seconds slower, whereas the extremely liberal html5lib is even slower. The overhead per page parsed is still relatively small with both bs4(html.parser) and bs4(lxml) adding less than 0.1 seconds per page parsed.

Overhead 100,000 URLs (Extra Hours) 500,000 URLs (Extra Hours)
Bs4 (html.parser) 0.09454 2.6 13.1
Bs4 (html5lib) 0.15564 4.3 21.6
Bs4 (lxml) 0.06388 1.8 8.9

While the overhead seems very low, when you try and scale a crawler using Beautiful Soup will add a significant overhead. Even using Beautiful Soup with lxml adds significant overhead when you are trying to scale to hundreds of thousands of URLs. It should be noted that the above table assumes a crawler running a single thread. Anyone looking to crawl more than 100,000 URLs would be highly recommended to build a concurrent crawler making use of a library such as Twisted, Asycnio, or Concurrent futures.

So, the question of whether Beautiful Soup is suitable for your project really depends on the scale and nature of the project. Replacing Beautiful Soup with lxml is likely to see you achieve a small (but considerable at scale) performance improvements. This does however come at the cost of losing the Beautiful Soup API, which makes selecting on-page elements a breeze.

Web Scraping: Avoiding Detection

 

This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.

Switching User Agent’s

Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.

You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.

This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.

Other Request Headers

Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.

Going Slow

Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.

Proxies

In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.

Rendering JavaScript

JavaScript is pretty much used everywhere and the number of human’s not enabling JavaScript in their browsers is less than <1%. This means that some sites have looked to block IPs making large numbers of requests without rendering JavaScript. The simple solution is just to render JavaScript, using a headless browser and browser automation suite such as Selenium.

Increasingly, companies such as Cloudflare are checking whether users making requests to the site are rendering JavaScript. By using this technique, they hope to block bots making requests to the site in question. However, several libraries now exist which help you get around the kind of protection implemented by Cloudflare. Python’s cloudflare-scrape library is a wrapper around the requests library which simply run’s Cloudflare’s JavaScript test within a node environment should it detect that such a protection has been put in place.

Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.

Backing Off

What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.

Avoiding Honeypots/Bot Traps

Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.

Obeying Robots.txt

Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.

Cookies

In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.

Captchas

Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.

By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.