Writing a Web Crawler with Golang and Colly

This blog features multiple posts regarding building Python web crawlers, but the subject of building a crawler in Golang has never been touched upon. There are a couple of frameworks for building web crawlers in Golang, but today we are going to look at building a web crawler using Colly. When I first started playing with the framework, I was shocked how quick and easy it was to build a highly functional crawler with very few lines of Go code.

In this post we are going to build a crawler, which crawls this site and extracts the URL, title and code snippets from every Python post on the site. To write such a crawler we only need to write a total of 60 lines of code! Colly requires an understanding of CSS Selectors which is beyond the scope of this post, but I recommend you take a look at this cheat sheet.

Setting Up A Crawler

To begin with we are going to set up our crawler and create the data structure to store our results in. First, of all we need to install Colly using the go get command. Once this is done we create a new struct which will represent an article, and contains all the fields we are going to be collecting with our simple example crawler.

With this done, we can begin writing our main function. To create a new crawler we must create a NewCollector, which itself returns a Collector instance. The NewCollector function takes a list of functions which are used to initialize our crawler. In our case we are only calling one function within our NewCollector function, which is limiting our crawler to pages found on “edmundmartin.com”.

Having done this we then place some limits on our crawler. As Golang, is a very performant and many websites are running on relatively slow servers we probably want to limit the speed of our crawler. Here, we are setting up a limiter which matches everything contains “edmundmartin” in the URL. By setting the parallelism to 1 and setting a delay of a second, we are ensuring that we only crawl one URL a second.

Basic Crawling Logic

To collect data from our target site, we need to create a clone of our Colly collector. We also create a slice of our ‘Article’ struct to store the results we will be collecting. We also add a callback to our crawler which will fire every time we make a new request, this callback just prints the URL which are crawler will be visiting.

We then add another “OnHTML” callback which is fired every time the HTML is returned to us. This is attached to our original Colly collector instance and not the clone of the Collector. Here we pass in CSS Selector, which pulls out all of the href’s on the page. We can also use some logic contained within the Colly framework which allows us to resolve to URL in question. If URL contains ‘python’, we submit it to our cloned to Collector, while if ‘python’ is absent from the URL we simply visit the page in question. This cloning of our collector allows us to define different OnHTML parsers for each clone of original crawler.

Extracting Details From A Post

We can now add an ‘OnHTML’ callback to our ‘detailCollector’ clone. Again we use a CSS Selector to pull out the content of each post contained on the page. From this we can extract the text contained within the post’s “H1” tag. We finally, then pick out all of the ‘div’ containing the class ‘crayon-main’, we then iterate over all the elements pulling out our code snippets. We then add our collected data to our slice of Articles.

All there is left to do, is start of the crawler by calling our original collector’s ‘Visit’ function with our start URL. The example crawler should finish within around 20 seconds. Colly makes it very easy to write powerful crawlers with relatively little code. It does however take a little while to get used the callback style of the programming.

Full Code

 

Image Classification with TFLearn and Python

In today’s post we are going to walk through how to build a flexible image classifier using TFLearn and Python. For those not familiar with TFLearn, it is a wrapper around the very popular Tensorflow library from Google. Building a image classifier with TFLearn is relatively simple and today we are going to walk through how to build your own image classifier.

Imports

We are going to need to import a number of different libraries in order to build our classifier. For users on Windows the easiest way to install the Scipy library is to use the pre-compiled wheel which can be found here. Once you have installed all the required imports, we can start building our ImageClassify class.

Initializing Our Class

When initializing our class, we are going to need to know a few pieces of information. We are going to need a list of class names. These are the names of the different objects that our classifier is going to classify. We also need to pass in an image size, the classifier will automatically resize our images into a square image of the specified size. So, if we pass in a value of 100, our classifier will end up resizing our images to be 100×100 pixels in size.

Generally, the larger the image size the better the classification we will end up with. This is provided that your images are larger than the specified value. It should be warned that using larger images will increase the time taken to train the algorithm. We store this value in self.image_size.

We also pass in default values for our learning rate and test split. The learning rate dictates how quickly the machine learning algorithm will discover new features. As a default value 0.001 tends to work well. The test split, defines what percentage of samples we will use to validate our model against. Again, using around ten percent of samples for your test size works pretty well.

We also create empty lists which will end up holding our image data and their respective labels

Labeling An Image

Our extract label function takes an image and extracts the label from the image. We begin by creating an array of zeroes, with a zero for each class to be trained. We then split the file name of the image. Our extract label functions expects images in the following format “class.number.png”. Using this format allows us to extract the class name directly from the image. We then look up the index of the class label and set that value in our array of zeros to a 1. Before returning the array itself.

Processing Images

Our process image function first calls our label function. We then read the image using skimage’s io.imread function. We then resize this image to the size specified when we initialized the class. We then append the image data and the labels to self.image_data and self.labels respectively.

Processing images is simply involves us using our process image function on every single image we provide to our image classification class.

Building Our Model

Our build model function simply builds us a convolutional net model, using the parameters we defined when initializing our class. Explaining the working’s of the net are probably beyond the scope of this post. But I will just note that creating our model like this allows our classifier to be used with images of any size and datasets with any number of classes. Creating a build model function also makes it easier to load and predict using pre-trained models.

Training Our Model

Our train_model function, takes a model name, epochs and a batch size parameter. The epochs parameter determines the number of times the model will be run over the entirety of the dataset. The batch size determines the number of samples to be run through the model at once. Generally, the more epochs the more accurate the model will be. Though too many epochs may mean that your model over fits the dataset and you end up with rather inaccurate predictions when you use the model to make predictions. If accuracy hits 100% and loss goes to 0, this is a very strong indication that you have over fit.

We first begin by creating X, y variables using the self.image_data and self.labels variables. We then use our self.test_split value to split the dataset up into training and test sets. We then call the build model function. We then call the fit method on model using both the train and test sets for validation purposes.

Once we have finished training the model. We save the model using the passed in model name and set self.model to be equivalent to our newly trained model.

Loading A Pre-Trained Model & Predicting Images

We can define a very simple function to load a model. This will be useful when we need to predict images sometime after we have trained a model. We can load a model by simply passing in the model’s name.

We then need another function to take an image and transform it to something we can use in our prediction function. This is much like our process image function, with the exception that we have no need to label the image.

Our predict image function takes a path to an image file. We call our _image_to_array function this data can then be fed straight into the model. Our model will then output an array of probabilities. We can line these up with the classes which we provided to the Image Classify class. We then pull out the most probable label, before returning this and the list of probabilities.

Example Usage: Training A Model

Example Usage: Making  A Prediction With Already Trained Model

Full Code & Example Dataset

The full code and an example data set can be found on my Github here. The Github also contains another image classification model which makes use of Google’s Googlenet model. This model is very highly accurate but takes a considerable amount of time to train and is likely to need to be run for a greater number of epics.

Text Classification with Python & NLTK

Machine learning frameworks such as Tensorflow and Keras are currently all the range, and you can find several tutorials demonstrating the usage of CNN (Convolutional Neural Nets) to classify text. Often this can be overkill and, in this post we are going to show you how to classify text using Python’s NLTK library. The NLTK (Natural Language Toolkit) provides Python users with a number of different tools to deal with text content and provides some basic classification capabilities.

Input Data

In the example, I’m using a set of 10,000 tweets which have been classified as being positive or negative. Our classifier is going to take import in CSV format, with the left column containing the tweet and the right column containing the label. An example of the data can be found below:

Using your own data is very simple and simply requires that your left column contains your text document, while the column on the right contains the correct label. Allowing our classifier to classify a wide range of documents with labels of your choosing. The data used for this example can be downloaded here.

The Bag of Words Approach

We are going to use a bag of words approach. Simply put, we just take a certain number of the most common words found throughout our data set and then for each document we check whether the document contains this word. The bag of words approach is conceptually simple and doesn’t require us to pad documents to ensure that every document in our sample set is the same length. However, the bag of words approach tends to be less accurate than using a word embedding approach. By simply checking whether a document contains a certain set of words we miss out on a lot of valuable information – including the position of the words in a said document. Despite this we can easily train a classifier which can achieve 80%+ accuracy.

Initialising our class and reading our CSV file

Our CSV classifier is going to take several arguments. Firstly, we pass the name of our CSV file. Then as optional parameters we pass featureset_size and a test ratio. By default, our classifier will use the most 1,000 common words found in our dataset to create our feature set. Additionally, we will test the accuracy of our classifier against 10% of the items contained in our data set. We then initialise a few variables which will be used later by our classifier.

We then come on to reading our CSV file. We simply iterate through each line and before splitting by our line by commas. The text after our last comma is the documents label, while everything to the left is the document in question. By applying a regex to our document, we produce a list of words contained in the said document. In the example, I used a very simple regex to pull out the words, but it is possible to replace this with a more complex tokenizer. For each word in the document we append this to the list of words. This will allow us to determine the frequency that words occur in our dataset. We also place the list of words found in our document into the variable where we store all the documents found in our dataset.

Extracting Word Features

We then write our functions for handling generating our feature set. Here, we use NLTK’s Frequency Dist class to store the frequency in which different words where found throughout the dataset. We then iterate through all of the words in the document, creating a new record should we have not seen the word before and incrementing the count should it have already been found. We then limit our bag of words to be equal to the feature set size we passed when we initialised the class.

Now we have got a list of the most frequently found words, we can write a function to generate features for each of the documents in our dataset. As we are using a bag of words approach we are only interested in whether the document contains each word contained in the 1,000 most frequent words. If we find the word we return True, otherwise we return False. Eventually, we get a dictionary of 1,000 features which will be used train the classifier.

Training

We start by shuffling the documents. Some algorithms and classifiers can be sensitive to the order of data. This makes important to shuffle our data before training. We then use our feature set function within a list comprehension which returns us a list of tuples containing our feature set dictionary and the documents label. We then calculate where to split our data into training and test sets. The test set allows us to check how the classifier performs against an unseen dataset. We can then pass our data set to nltk’s naïve bayes classifier. The actual training may take some time and will take longer the larger the dataset used. We then check the classifiers accuracy against both the training and test set. In all likelihood the classifier will perform significantly better against the training set.

Classifying New Documents

Once we have trained a classifier we can then write a function to classify new documents. If we have not already loaded our CSV file and generated the word features, we will have to do this before classifying our new document. We then simply make generate a new set of features for this document before passing this to our classifier class. We will then pass this to our classifier and call the classify method with the feature set. The function will return the string of the predicted label.

Saving and Loading Model

Rather than training the model every time we want to classify the sentence, it would make sense to save the model. We can write two simple functions to allow us to reuse our model whenever we want. The save functions simply saves our classifier and feature words to objects to files, which can then be reloaded by our load model function.

Accuracy

Algorithm Train Test
Naive Bayes Classifier (NLTK) 84.09% 72.89%
BernouliNB (Sklearn) 83.93% 79.78%
MultinomiaNB (Sklearn) 84.58% 74.67%
LogisticRegression (Sklearn) 89.05% 75.33%
SGDClassifier (Sklearn) 81.23% 69.32%

The algorithm performs relatively well against our example data. Being able to correctly classify whether a Tweet is positive or negative around 72% of the time. NLTK gives it’s users the option to replace the standard Naive Bayes Classifier with a number of other classifiers found in the Sci-kit learn package. I ran the same test swapping in these classifiers for the Naive Bayes Classifier, and a number of these classifiers significantly outperformed the standard naive classifier. As you can see the BernouliNB model performed particularly well, correctly classifying documents around 80% of the time.

The accuracy of the classifier could further be improved by using something called an ensemble classifier. To build an ensemble classifier we would simply build several models using different classifiers and then classify new documents against all of these classifiers. We could then select the answer which was provided by the majority of our classifiers (hard voting classifier).  Such a classifier would likely outperform just using on of the above classifiers. The full code below provides a function that allows you to try out other Sklearn classifiers.

Example Usage

The class is pretty easy to use. The above code outlines all of the steps required to train a classifier and classify an unseen sentence. More usage examples and the full code can be found on Github here.

Full Code

 

Scraping Baidu with Python

 

What’s Baidu?

Baidu is China’s largest search engine and has been since Google left the market in {year}. As companies look to move into the Chinese market, there has been more and more interest in scraping search results from Baidu.

Scraping Baidu

Scraping Baidu is a relatively simple task. When scraping results from Baidu there is only minor challenge, the URLs displayed on the Baidu results page are found nowhere in the HTML. Baidu links to the sites displayed on the search results page via their own redirector service. In order to get the full final URL we have to follow these redirects. In this post we are going to walk through how to scrape the Baidu search results page.

Imports & Class Definition

In order to scrape Baidu, we only need to import two libraries outside of the standard library. Bs4 helps us parse HTML, while requests provides us with a nicer interface for making HTTP requests with Python.

As we are going to scrape multiple pages of Baidu in this tutorial and for this purpose we are going to initialise a class to hold onto the important information for us.

We initialise a new class of the BaiduBot, with a search term and the number of pages to scrape. We also give ourselves the ability to pass a number of keyword arguments to our class. This allows us to pass a proxy, a custom connection timeout, custom user agent and an optional delay between each of the results page we want to scrape. The keyword arguments may be of a lot of help, if we end up being block by Baidu.  When initialising the class we also store our base URL, which we use when scraping the subsequent pages.

Making Requests & Parsing HTML

We first define a function to scrape a page of Baidu, here we simply try to make a request and check that the response has a 200 Status. Should Baidu start serving us with non-200 status codes, this likely means that they have detected unusual behaviour from our IP and we should probably back off for a while. If there is no issue with the request, we simply return the response object.

Now that we have a way to make HTML requests, we need to write a method for parsing the results page. Our parser is going to take in the HTML and return us with a list of dictionary objects. Each result is handily contained within a ‘div’ called ‘c-container’. This makes it very easy for us to pick out each result. We can then iterate across all of our returned results, using relatively simply BeautifulSoup selectors. Before appending the result to our results list.

Getting the Underlying URL

As previously mentioned the full underlying URL is not displayed anywhere in Baidu’s search results. This means we must write a couple of functions to extract the full underlying URL. There may be another way to get this URL, but I’m not aware of it. If you know how, please share the method with me in the comments.

Our resolve_urls function is very similar to our Baidu request function. Instead of a response object we are returning the final URL by simply following the chain of redirects. Should we encounter any sort of error we are simply returning the original URL, as found within the search results. But this issue is relatively rare, so it shouldn’t impact our data too much.

The we write another function that allows us to use our resolve_urls function over a set of results, updating the URL within our dictionary with the real underlying URL and the rank of the URL in question.

Bringing It All Together

We bring this altogether in our scrape_baidu function. We range over our page count variable. For each loop we run through we multiple by our variable by 10, to get the correct pn variable. The pn variable represents the result index, so our logic ensures we start at 0 and continue on in 10 result increments. We then format our URL using both our search term and this variable. We then simply make the request and parse the page using the functions we have already written. Before appending the results to our final results variable. Should we have passed a delay argument, we will also sleep for a while before scraping the next page. This will help us avoided getting banned should we want to scrape multiple pages and search terms.

Full Code

 

Optimal Number of Threads in Python

One of the most infamous features of Python is the GIL (Global Interpreter Lock), this means thread performance is significantly limited. The GIL responsible for protecting access to Python objects, this is because CPython’s memory management is not thread safe. Threads in Python essentially interrupt one another, meaning that only one thread has access to Python objects at one time. In many situations this can cause a program to run slower using threads, particularly if these threads are doing CPU bound work. Despite their limitations Python threads can be very performant for I/O bound work such as making a request to a web server. Unlike highly concurrent languages such as Golang and Erlang we cannot launch thousands of Goroutines or Erlang ‘processes’.

This makes it hard to determine the correct number of threads we should use to boost performance. At some point adding more threads is likely to degrade overall performance. I wanted to take a look at the optimal number of threads for an I/O bound task, namely making a HTTP request to a web server.

The Code

I wrote a short script using Python 3.6, requests and the concurrent futures library which makes a get request to the top 1,000 sites according to Amazon’s Alexa Web rankings. I then reran the script using a different number of threads, to see where performance would begin to drop off. To take account of uncontrollable variables, I re-ran each number of threads 5 times to produce an average.

The Results

As you can see as we first start increasing the number of threads used by our demo program, the number HTTP requests we can make per second increases quite rapidly. The increase performance starts dropping once we reach around 50 threads. Finally, once we use a total of 60 threads we actually start to see our HTTP request rate decrease, before it again starts get to faster as we approach a total 200 threads.

This is presumably due to the GIL. Once we add a significant amount of threads, each of the threads are essentially interfering with one another slowing down our program. In order to increase the performance of our code we would have to look into ways of releasing the GIL, such as writing a C extension or releasing the GIL within Cython code. I highly recommend watching this talk from David Beazley for those looking to get a better understanding of the GIL.

Despite the GIL our best result saw us make a total of 1,000 HTTP Get requests in a just a total of nine seconds. This sees us making a total of 111 HTTP requests per second! Which isn’t too bad for what is meant to be a slow language.

Caveats

The results from this experiment suggest that those writing threaded Python applications, should certainly take some time running tests to determine the optimum number of threads. The example used to run this test used I/O bound code, with little CPU overhead. Those running code with a greater amount of CPU bound code may find that they get less benefit from upping the number of threads. Despite, this I hope that this thread encourages people to look into using threads within their application. The increases performance achievable will depend highly on what exactly is being done within the threads.

There is also reason to believe that the optimal number of threads may differ from machine to machine. Which is another reason why it is certainly worth taking the time to test out a varying number of threads when you need to achieve maximum performance.

Ultimate Introduction to Web Scraping in Python: From Novice to Expert

Python is one of the most accessible fully featured programming languages, which makes it a perfect language for those looking to learn to program. This post aims to introduce the reader to web scraping, allowing them to build their own scrapers and crawlers to collect data from the internet.

Contents

  1. Introduction to Web Scraping
  2. Making HTTP Requests with Python
  3. Handling HTTP Errors
  4. Parsing HTML with BeautifulSoup

MORE TO COME

Introduction to Web Scraping

Web Scraping, sometimes referred to screen scraping is the practice of using programs to visit websites an extract information from them. This allows users to collect information from the web in a programmatic manner, as opposed to having to manually visit a page and extract the information into some sort of data store. At their core, major search engines such as Google, and Bing make use of web scraping to extract information from millions of pages every day.
Web scraping has a wide range of uses, including but not limited to fighting copyright infringement, collecting business intelligence, collecting data for data science, and for use within the fintech industry. This mega post is aimed at teaching you how to build scrapers and crawlers which will allow you to extract data from a wide range of sites.

This post assumes that you have Python 3.5+ installed and you have learnt how to install libraries via Pip. If not, it would be a good time to Google ‘how to install python’ and ‘how to use pip’. Those familiar with the requests library may want to skip ahead several parts.

Making HTTP Requests with Python

When accessing a website our browser makes a number of HTTP requests in the background. The majority of internet users aren’t aware of the number of HTTP requests required to access a web page. These requests load the page itself and may make additional requests to resources which our loaded by the page such as images, videos and style sheets. You can see a breakdown of the requests made by opening up your browser’s development tools and navigating to the ‘Network’ tab.

The majority of requests made to a website are made using a ‘GET’ request. As the name suggests a ‘GET’ request attempts to retrieve the content available at the specified address. HTTP supports a variety of other methods such as ‘POST’, ‘DELETE’, ‘PUT’ and ‘OPTIONS’. These methods are sometimes referred to as HTTP verbs. We will discuss these methods later.

Python’s standard library contains a module which allows us to make HTTP requests. While this library is perfectly functional the user interface is not particularly friendly. In this mega post we are going to make use of the requests library which provides us with a much friendlier user-interface and can be installed using the command below.

Making a HTTP request with Python can be done in a couple of lines. Below we are going to demonstrate how to make a request and walk through the code line by line.

First, we import the requests library which gives us access to the functions contained within the library. We then make a HTTP request to ‘https://edmundmartin.com’ using the ‘GET’ verb, by calling the get method contained within the requests library. We store the result of this request in a variable named ‘response_object’. The response object contains a number of pieces of information that are useful when scraping the web. Here, we access the text (HTML) of the response which we print to the screen.  Provided the site is up and available users running this script should be greeted with a wall of HTML.

Handling HTTP Errors

When making HTTP requests there is significant room for things to go wrong. Your internet collection may be down or the site in question may not be reachable. When scraping the internet we typically want to handle this errors and continue on without crashing our program.  For this we are going to want to write a function, which will allow us to make a HTTP request and deal with any errors. Additionally, by encapsulating this logic within a function we can reuse our code with greater ease, by simply calling the function every time we want to make HTTP request.

The below code is an example of a function which makes a request and deals with a number of common errors we are likely to encounter. The function is explained in more detail below.

Our basic get_request function takes one argument, the string of the URL we want to retrieve. We then make the request just as before. This time however our request is wrapped in a try and except block which allows to catch any errors should something go wrong. After the request we then then check the status code of our response. Every time you make a request the server in question will respond with a code indicating whether the request has been a success or not. If everything went fine then you will receive a 200 status code, otherwise you are likely to receive a 404 (‘Page Not Found’) or 503 (‘Service Unavailable’). By default, the requests library does not throw an error should a web server respond with a bad status code but rather continues silently. By using raise_for_status we force an error should we receive a bad status code. Should there be no error thrown we then return our response object.

If all did not go so well, we then handle all of our errors. Firstly, we check whether the page responded with a non-200 status code, by catching the requests.HTTPError. We then check whether the request failed due to a bad connection by checking for the requests.ConnectionError exception. Finally, we use the generic requests.RequestException to catch all other exceptions that can be thrown by the requests library. The ordering of our exceptions is important, the requests.RequestException is the most generic and would catch either of the other exceptions. Should this have been the first exception handled, the other lines of code would never ever run regardless of the reason for the exception.

When handling each exception, we use the standard library’s logging library to print out a message of what went wrong when making the request. This is very handy and is a good habit to get into, as it makes debugging programs much easier. If an exception is thrown we return nothing from our function, which we can then check later. Otherwise we return the response.

At the bottom of the script, I have provided a simple example of how this function could be used to print out a page’s HTML response.

Parsing HTML with BeautifulSoup

So everything we have done has been rather boring and not particularly useful. This is due to the fact that we have just been making requests and then printing the HTML. We can however do much more interesting things we our responses.

This is where BeautifulSoup comes in. BeautifulSoup is a library for the parsing of HTML, allowing us to easily extract the elements of the page that we are most interested in. While BeautifulSoup is not the fastest way to parse a page, it has a very beginner friendly API. The BeautifulSoup library can be install by using the following:

The code below expands on the code we wrote in the previous section and actually uses our response for something.  A full explanation can be found after the code snippet.

The code snippet above uses the same get_request function as before which I have removed for the sake of brevity. Firstly, we must import the BeautifulSoup library, we do this by adding the line ‘from bs4 import BeautifulSoup’. Doing this gives us access to the BeautifulSoup class which is used for parsing HTML responses. We then generate a ‘soup’ by passing our HTML to BeautifulSoup, here we also pass a string signifying the underlying html parsing library to be used. This is not required but BeautifulSoup will print a rather long winded warning should you omit this.

Once the soup has been created we can then use the ‘find_all’ method to discover all of the elements matching our search. The soup object, also has a method ‘find’ which will only return the first element matching our search. In this example, we first pass in the name of the HTML element we want to select. In this case it’s the heading 2 element represented in HTML by ‘h2’. We then pass a dictionary containing additional information. On my blog all article titles are ‘h2’ elements, with the ‘class’ of ‘entry-title’. This class attribute is what is used by CSS to make the titles stand out from the rest of the page, but can help us in selecting the elements of the page which we want.

Should our selector find anything we should be returned with a list of title elements. We can then write a for loop, which goes through each of these titles and prints the text of the title, by calling the get_text() method. A note of caution, should your selector not find anything calling the get_text() method on the result will throw an exception.  Should everything run without any errors the code snippet above should return the titles of the ten most recent articles from my website. This is all that is really required to get started with extracting information from websites, though picking the correct selector can take a little bit of work.

In the next section we are going to write a scraper which will extract information from Google, using what we have learnt so far.

Selenium Tips & Tricks in Python

Selenium is a great tool and can be used for a variety of different purposes. It can sometimes however be a bit tricky to make Selenium behave exactly how you want. This article shows you how you can make the most of the libraries advanced features, to make your life easier and help you extract data from websites.

Running Chrome Headless

Provided you have one of the latest versions of Chromdriver, it is now very easy to run selenium headless. This allows you to run the browser in the background without a visible window. We can simply add a couple of lines code to our browser on start-up and accessing webpages with selenium running quietly in the background. It should be noted that some sites can detect whether you are running Chrome headless and may block you from accessing content.

Using A Proxy With Selenium

There are occasions that you may want to use a proxy with Selenium. To use a proxy with Selenium we simply add an argument to Chrome Options when initialing our Selenium instance. Unfortunately, there is no way to change the proxy used once set. This means to rotate proxies while using Selenium, you have to either restart the Selenium browser or use a rotating proxy service which can come with it’s own set of issues.

Accessing Content Within An Iframe

Sometimes the content we want to extract from a website may be buried within an iframe. By default when you ask Selenium to return you the html content of a page, you will miss out on all the information contained within any iframes on the page. You can however access content contained within the iframe.

To switch to the iframe we want to extract data from, we first use Selenium’s find_element method. I would recommend using find_element_by_css_selector method which tends to be more reliable than trying to extract content by using an xpath selector. We then pass our target to a method which allows us to switch the browsers context to our target iframe. We can then access the HTML content and interact with content within the iframe. If we want to revert back to our original context, we simply call the revert to default content by switching to the default content.

Accessing Slow Sites

The modern web is overloaded with JavaScript, and this can cause Selenium to throw a lot of timeout errors, with Selenium timing out if a page takes more than 20 seconds to load. The simplest way to deal with this to increase Selenium’s default timeout. This is particularly useful when trying to access sites via a proxy, which slow down your connection speed.

Scrolling

Selenium by default does not allow users to scroll down pages. The browser automation framework does however allow users to execute JavaScript. This makes it very easy to scroll down pages, this is particularly useful when trying to scrape content from a page which continues to load content as the user scrolls down.

For some reason Selenium can be funny with executing window scroll commands, and it is sometimes necessary to call the command in a loop in order to scroll down the entirety of a page.

Executing JavaScript & Returning The Result

While many users of Selenium know that is is possible to run JavaScript allowing for more complicated interactions with the page, fewer know that it is also possible to return the result of executed JavaScript. This allows your browser to execute functions defined in the pages DOM and return the results to your Python script. This can be great for extracting data from tough to scrape websites.

Multi-Threaded Crawler in Python

Python is a great language for writing web scrapers and web crawlers. Libraries such as BeauitfulSoup, requests and lxml make grabbing and parsing a web page very simple. By default, Python programs are single threaded. This can make scraping an entire site using a Python crawler painfully slow. We must wait for each page to load before moving onto the next one. Thankfully, Python supports threads which while not appropriate for all tasks, can help us increase the performance of our web crawler.

In this post we are going to outline how you can build a simple multi-threaded crawler which will crawl an entire site using requests, BeautifulSoup and the standard library’s concurrent futures library.

Imports

We are going to begin by importing all the libraries we need for scraping. Both requests and BeautifulSoup are not included within the Python standard library. So you are going to have to install them if you haven’t done already. The other libraries you should already have available to you, if you are using Python3.

Setting Up Our Class

We start by initialising the class we are going to use to create our web crawler. Our initialisation statement only takes one argument. We pass our start URL as an argument, and from this we use urlparse from the urllib.parse library to pull out the sites homepage. This root URL is going to be used later to ensure that our scraper doesn’t end up on other sites.

We also initialise a thread pool. We are later going to submit ‘tasks’ to this thread pool, allowing us to use a callback function to collect our results. This will allow us to continue with execution of our main program, while we await the response from the website.

We also initialise a set which is going to contain a list of all the URLs which we have crawled. We will use this store URLs which have already been crawled, to prevent the crawler from visiting the same URL twice.

We then finally create a Queue which will contain URLs we wish to crawl, we will continue to grab URLs from our queue until it’s empty. Finally, we place in our base URL to the start of the queue.

Parsing Links and Scraping

Next, we write a basic link parser. Our goal here is to extract all of a sites internal links and not to pull out any external links. Additionally, we want to resolve relative URLs (those starting with ‘/’) and ensure that we don’t crawl them same URLs twice.

To do this we generate a Soup object using BeautifulSoup. We then use the find_all method to return every ‘a’ element which has a ‘href’ property. By doing this we ensure that we only return ‘a’ elements which contain a link. Our returned object is a list of dictionary objects which we then iterate through. First, we pull out the actual href content. We can then check whether this link is relative (starting with a ‘/’) or starts with our root URL. If this is the case we can then use URL join to generate a crawlable URL and then put this in our queue provided we haven’t already crawled it.

I have also included an empty scrape_info method which can be overridden so you can extract the data you want from the site you are crawling.

Defining Our Callback

The easiest and often most performant way to use a thread pool executor is to add a callback to the function we submit to the thread pool. This function will execute after the previous function has completed and will be passed the result of our previous function as an argument.

By calling .result() on the passed in argument we are able to get to the contents of our returned value. In our case this will be either ‘None’ or a requests response object. When then check if we have a result and whether the result has a 200 status code. If both of these turn out to be true we send the html to the parse_links and currently empty scrape_info function.

Scraping Pages

We then define our function which will be used to scrape the page. This function is very simple and simply takes a URL and returns a response object if it was successful. Otherwise we return ‘None’. By limiting the amount of CPU bound work we do in this function, we can increase the overall speed of our crawler. Threads are not recommended when doing CPU bound work and can actually turn out to be slower than using a single thread.

Run Scraper Function

The run scraper function brings all of our previous work together and manages our thread pool. The run scraper will continue to run while there are still URLs to crawl. We do this by creating a while True loop and ignoring any exceptions except Empty, which will be thrown if our queue has been empty for more than 60 seconds.
We keep pulling URLs from our queue and submitting them to our thread pool for execution. We then add a callback which will run once the function has returned. This function in turns calls our parse link and scrape info functions. This will continue to work until we run out of URLs.
We simply add a main block at the bottom of our script to run the function.

Performance

When testing this script on several sites with performant servers, I was able to crawl several thousand URLs a minute with only 20 threads. Ideally, you would use a lower number of threads to avoid potentially overloading the site you are scraping.

Performance could be further improved by using XPath and ‘lxml’ to extract links from the site. This is due ‘lxml’ being written in Cython and considerably faster than BeautifulSoup which uses a pure Python solution.

Full Code

 

Scraping JavaScript Heavy Pages with Python & Splash

Scraping the modern web can be particularly challenging. These days many websites make use of JavaScript frameworks to serve much of a pages important content. This breaks traditional scrapers as our scrapers are unable to extract the infromation we need from our initial HTTP request.

So what should we do when we come across a site that makes extensive use of JavaScript? One option is to use Selenium. Selenium provides us with an easy to use API, with which we can automate a web browser. This great for tasks when we need to interact with the page, whether that be to scroll or click certain elements. It is however a bit over the top when you simply want to render JavaScript.

Introducing Splash

Splash, is a JavaScript rendering service from the creators of the popular Scrapy framework. Splash can be run as a server on your local machine. The server built using Twisted and Python allows us to scrape pages using the servers HTTP API. This means we can render JavaScript pages without the need for a full browser. The use of Twisted also means we can also

Installing Splash

Full instructions for installing Splash can be found in the Splash docs. That being said, it is highly reccomended that you use Splash with Docker which makes starting and stopping the server very easy.

Building A Custom Python Crawler With Splash

Splash was designed to be used with Scrapy and Scrapinghub, but it can just as easily be used with Python. In this example we are going to build a multi-threaded crawler using requests and Beautiful Soup. We are going to scrape an e-commerce website which uses a popular JavaScript library to load product information on category pages.

Imports & Class Initialisation

To write this scraper we are only going to use two libraries outside of the standard library. If you have ever done any web scraping before, you are likely to have both Requests and BeautifulSoup installed. Otherwise go ahead and grab them using pip.

We then create a SplashScraper class. Our crawler only takes one argument, namely the URL we want to begin our crawl from. We then use the URL parse library to create a string holding the site’s root URL, we use this URL to prevent our crawler from scraping pages not on our base domain.

One of the main selling points of Splash, is the fact that it is asynchronous. This means that we can render multiple pages at a time, making our crawler significantly more performant than using a standalone instance of Selenium. To make the most of this we are going to use a ThreadPool to scrape pages, allowing us to make up to twenty simultaneous requests.

We create queue which are going to use to grab URLs from and send to be executed in our thread pool. We then create a set to hold a list of all the pages we have already queued. Finally, we put the base URL into our queue, ensuring we start crawling from the base URL.

Extracting Links & Parsing Page Data

Next we define two methods to use with our scraped HTML. Firstly, we take the HTML and extract all the links which contain a href attribute. We iterate over our list of links pulling out the href element. If the URL starts with a slash or starts with the site’s URL, we call urlparse’s urljoin method which creates an absolute link out of the two strings. If we haven’t already crawled this page, we then add the URL to the queue.

Our scrape_info method simple takes the HTML and scrapes certain information from the rendered HTML. We then use some relatively rough logic to pull out name and price information before writing this information a CSV file. This method can be overwritten with custom logic to pull out the particular information you need.

Grabbing A Page & Defining Our Callback

When using a thread pool executor, one of the best ways of getting the result out of a function which will be run in a thread is to use a callback. The callback will be run once the function run in the thread has completed. We define a super simple callback that unpacks our result, and then checks whether the page gave us a 200 status code. If the page responded with a 200 hundred, we then run both our parse_links and scrape_info methods using the page’s HTML.

Our scrape_page function is very simple. As we are simply making a request to a server running locally we don’t need any error handling. We simply pass in a URL, which is then formatted into the request. We then simple return the response object which will then be used in our callback function defined above.

Our Crawling Method

Our run_scraper method is basically our main thread. We continue to try and get links from our queue. In this particular example we have set a timeout of 120 seconds. This means that if we are unable to grab a new URL from the queue, we will raise an Empty error and quit the program. Once we have our URL, we check if it is not in the our set of already scraped pages before adding it to the list.  We then send of the URL for scraping and set our callback method to run once we have completed our scrape. We ignore any exception and continue on with our scraping until we have run out of pages we haven’t seen before.

The script in it’s entirety can be found here on Github.

Scraping Google with Golang

I have previously written a post on scraping Google with Python. As I am starting to write more Golang, I thought I should write the same tutorial using Golang to scrape Google. Why not scrape Google search results using Google’s home grown programming language.

Imports & Setup

This example will only being using one external dependency. While it is possible to parse HTML using Go’s standard library, this involves writing a lot of code. So instead we are going to be using the very popular Golang library, Goquery which supports JQuery style selection of HTML elements.

Defining What To Return

We can get a variety of different information from Google, but we typically want to return a result’s position, URL, title and description. In Golang it makes sense to create a struct representing the data we want to be gathered by our scraper.

We can build a simple struct which will hold an individual search result, when writing our final function we can then set the return value to be a slice of our GoogleResult struct. This will make it very easy for us to manipulate our search results once we have scraped them from Google.

Making A Request To Google

To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

But first we are going to define a “map” of supported Google geo locations. In this post we are only going to support a few major geographical locations, but Google operates in over 100 different geographical locations.

This will allow pass a two letter country code to our scraping function and scrape results from that particular version of Google. Using the different base domains in combination with a language code allows us to scrape results as they appear in the country in question.

We then write a function that allows us to build a Google search URL. The function takes in three arguments, all of the string type and returns a URL also a string. We first trim the search term to remove any trailing or proceeding white-space.  We then replace any of the remaining spaces with ‘+’, the -1 in this line of code means that we replace every-single remaining instance of white-space with a plus.

We then look up the country code passed as an argument against the map we defined earlier. If the countryCode is found in our map, we use the respective URL from the map, otherwise we use the default ‘.com’ Google site. We then use the format packages “Sprintf” function to format a string made up of our base URL, our search term and language code. We don’t check the validity of the language code, which is something we might want to do if we were writing a more fully featured scraper.

We can now write a function to make a request. Go has a very easy to use and power “net/http” library which makes it relatively easy to make HTTP requests. We first get a client to make our request with. We then start building a new HTTP request which will be eventually executed using our client. This allows us to set custom headers to be sent with our request. In this instance we our replicating the User-Agent header of a real browser.

We then execute this request, with the client’s Do method returning us a response and error. If something went wrong with the request we return a nil value and the error. Otherwise we simply return the response object and a nil value to show that we did not encounter an error.

Parsing the Result

Google results are divided up to ‘div’ elements with the class ‘g’.

Now we move onto parsing the result of request. Compared with Python the options when it comes to HTML parsing libraries is not as robust, with nothing coming close to the ease of use of BeautifulSoup. In this example, we are going to use the very popular Goquery package which uses JQuery style selectors to allow users to extract data from HTML documents.

We generate a goquery document from our response, and if we encounter any errors we simply return the error and a nil value object. We then create an empty slice of Google results which we will eventually append results to. On a Google results page, each organic result can be found in ‘div’ block with the class of ‘g’. So we can simply use the JQuery selector “div.g” to pick out all of the organic links.

We then loop through each of these found ‘div’ tags finding the link and it’s href attribute, as well as extracting the title and meta description information. Providing the link isn’t an empty string or a navigational reference, we then create an GoogleResult struct holding our information. This can then be appended to the slice of structs which we defined earlier. Finally, we increment the rank so we can tell the order in which the results appeared on the page.

Wrapping It All Up

We can then simply write a function which encompasses all the previous functions. Note when writing this function we capitalise the function name to ensure that it is exported. This will allow us to use this function in other Go programs. We don’t do much in the way of error handling, if anything goes wrong we simply return nil and the error, without any logging along the way. Ideally, we would want some sort of logging to alert us exactly what went wrong with a particular scrape.

Example Usage

The above program makes use of our GoogleScraper function by working through a list of keywords and scraping search results. After each scrape we are waiting a total of 30 seconds, this should help us avoid being banned. Should we want to scrape a larger set of keywords, we would want to randomise our User-Agent and change up the proxy we were using in each request. Otherwise we are very likely to run into a Google captcha which would prevent us from gathering any results.

The full Google scraping script can be found here. Feel free to play with it and think about some of the additional functionality that could be added. You might for instance want to scrape the first few pages of Google, or pass in a custom number of results to be returned by the script.