Quick Tour of The Concurrent Futures Library

Threading in Python often gets a bad rap, however the situation around threading has gotten a lot better since the Concurrent Futures library was introduced in version 3.2 of Python. Python threads however will only give you an increase in performance in certain circumstances, firstly it is only recommended to use threads for IO bound tasks. The reason for this is to complicated to go into here, but are related to the workings of Python’s GIL (Global Interpreter Lock) Those looking to improve performance of CPU heavy tasks instead need to make use of multiprocessing. In fact, the concurrent futures library provides same interface for working with both threads and processes. The code in this post will focus on using the library with threads but many of the same patterns can be applied to code making use of a process pool.

Thread Pools

A core part of the concurrent futures library is the ability to create a thread pool executor. Thread pools make it much easier to manage a bunch of threads. We simple create an instance of a thread pool and set the number of threads we want to use. We can then submit jobs to be run the thread pool. This first example just shows us how to submit a job to a thread, however at this point we have no way to work with the results.

Futures – As Completed

We are going to begin looking at the as completed pattern. This allows to submit a bunch of different tasks to our thread pool and retrieve the results as the tasks have been completed. This construct can be very handy if we want to do a bunch of blocking IO tasks and then process the results once all the tasks have been completed. In the above example, we use the get_page function as before which has be omitted for brevity.

Here we simply submit our simple task to the thread pool executor we can then wait for all of our submitted tasks to be completed. We can also set an optional timeout which will see our tasks timeout if they take too long.

Futures – Mapping

We can also use the thread pools executor’s map function to take a group of tasks and then map these to the different threads in our thread pool. The first argument to the call is the function in question and then a iterable of arguments to be passed into the function this then returns us a list of futures. Getting results from this relatively easy, and we can get the final results of the tasks in question by simply calling list over our iterable of futures objects. This then gives us a list of results which can work with the list of results just as if they had been returned from a normal function.

Futures – Callback

Callbacks provide Python users one of the most powerful methods for working with thread pools. With a callback we can submit a task to our thread pool and then call add_done_callback to the future object returned from our thread pool submission. What is more tricky is that the callback takes only one argument which is the result of the future. We can then perform various actions to the future result, such as checking for whether an exception was thrown or whether the future was cancelled before it was able to complete it’s task. We can then finally handle  and process the result of the future. This allows for some more complicated concurrent programming patterns with the callbacks feeding a queue of additional jobs to be processed.

Top Python Books

As a self-taught software engineer books played an important role in moving from digital marketing into software engineering. While many people these days seem to prefer learning via video courses, I have always found books the better tool in my experience.

For Beginners

Automate The Boring Stuff with Python

This well known book written by the effable Al Sweigart, probably represents one of the better books for absolute beginners. The book introduces basic programming concepts such as functions and types while allowing users to write short programs that do something useful. I think this approach very appropriate for absolute beginners who need to see the power of what they can do with code. The only weakness of the book is that subjects such as classes are completely passed over. I can however highly recommend the book to anyone who is interested in picking up Python and wants to get up to speed relatively quickly. The book can also be read online for free.

Think Python

Think Python started out as a Python translation of a Java textbook aimed at new Computer Science students. The author of the original textbook then went on to update the book with Python as the core language. The book introduces users to the Python language and aims to get them thinking like computer scientists. This book is a great choice for those who are more serious about Python and want to pursue programming as a career. However, it is likely less instantly gratifying than the ‘Automate The Boring Stuff’ book. Think Python can also be found online for free, which makes it another great choice for beginners.

Grokking Algorithms

Grokking Algorithms is technically not a Python book, but all of the examples contained in the book are written in Python. The book provides readers with a good introduction to algorithmic thinking and several core algorithms that are widely used in computer science. This book should definitely not be your first programming book but should you have made progress with both ‘Think Python’ and ‘Automate The Boring Stuff’ this book could certainly be worth reading and is particularly useful if you are looking to interview for junior level roles.


For More Advanced Readers

Fluent Python

Fluent Python is considered a classic among intermediate to advanced Python programmers with the book taking the reader through a comprehensive tour of Python 3. The book is very well written and those reading it are likely to come away with a better understanding of the language as well as picking up some new perspectives on more familiar topics. This is certainly a book for more advanced readers and it took me a while to fully appreciate the book. I would thoroughly recommend the book to upper intermediate Python users and don’t have any real complaints about the books content.  The only minor gripe would be that the Asyncio examples in the book are already outdated using the old 3.4 coroutine syntax which is likely to be deprecated in future versions of Python.

Effective Python

This relatively short but valuable book by Brett Slatkin is effectively a tour of Python best practices. The book provides readers with 59 specific ways to write better Python. The book is broken into chapters which gather points together in an understandable manner. The breadth covered by the book is impressive with chapters on everything from Pythonic Thinking to Concurrency and Parallelism. The topics covered by the book get more advanced the further you progress through the book, meaning that there is something for everyone. Again, another great feature of this book, is the ability to jump into a specific section should you be looking to refresh your memory on a specific subject. Every so often I take a dive back into Effective Python when I want a refresher on a particular topic.

Python Cookbook

The Python Cookbook is not a textbook nor does it deal with a subject matter topic. It is instead focused on showing readers how they can deal with a range of different programming problems in a Pythonic way. The book is broken into a number of different sections dealing with a diverse range of problems. Each problem comes with one or more solutions and discussion regarding the solution. Reading through the book is enlightening and you are likely to learn a significant amount through consuming the book. The structure of the book is also making it easy to dive in and out should you only be interested in solutions to a particular problem.

Python Tricks

The name of this book does it a bit of a disservice. This book is great a resource for intermediate level Python developers looking to push their skills to the next level. Those who are relatively new to the language will definitely pick up some useful information on how to use Python more smartly. The book was only released relatively recently put has picked up a huge number of glowing reviews.

Basic Introduction to Cython

Python is often criticized for being slow. In many cases pure Python is fast enough, there are certain cases where Python may not give you the performance you need. In recent years a fair number of Python programmers have made the jump to Golang for performance reasons. However there are a number of ways you can improve the performance of your Python code such as using PyPy, writing C extensions or trying your hand at Cython.

What is Cython?

Cython is a superset of the Python language. This means that the vast majority of Python code, is also valid Cython code. Cython allows users to write Cython modules which are then compile and can be used within in Python code. This means that users can port performance critical code into Cython and instantly see increases in performance. The great thing about Cython code is that you can determine how much to optimize your code.  Simply copying and compiling your Python code might see you make performance gains of 8-12%, whereas more serious optimization of your code can lead to significantly better performance.

Installing Cython

Installing Cython on Linux is very easy to do and just requires you to use the ‘pip install cython’ command. Those on Windows devices will likely have a much tougher time, with the simplest solution seeming to be just installing ‘Visual Code Community’ and selecting both the C++ and Python support options. You can then just install Cython like you would any other Python package.

Using Pure Python

We are going to begin with compiling a pure Python function. This is a very simple task, and can achieve some limited performance benefits, with a more noted increase in performance for functions which make use of for and while loops. To begin we simply save the below code into a file called ‘looping.pyx’.

This very simple function takes a list of numbers and then multiples each number by it’s index, and returning the sum of the results. This code is both valid Python and Cython code. However, it takes no advantage of any Cython optimizations other than the compilation of the code into C.

We run the below command to create Cython module which can be used in Python:

We can then import our Cython module into Python in the following manner:

What kind of performance benefits can we expect from just compiling this Python code into a Cython module?

I ran some tests and on average the Cython compiled version of the code took around 10% less time to run over a set of 10,000 numbers.

Adding Types

Cython achieves optimization of code by introducing typing to Python code. Cython supports both a range of Python and C types. Python types tend to be more flexible but give you less in terms of performance benefits.  The below example makes use of both C and Python types, however we have to be very careful when using C types. For instance we could throw an overflow error should the list of numbers we pass in be too large and the result of the multiplication being to large to store in a C long.

As you can see we use the Python type ‘list’ to type annotate our input list. We then define two C types which will be used to store the length of our list and our output. We then loop over our list in exactly the same way as we did in our previous example. This shows just how easy it is to start adding C types to code with the help of Python. It also illustrates how easy it is to mix both C and Python types together in one extension module.

This hybrid code when tested was between 15-30% faster than the pure Python implementation without taking the most aggressive path of optimization and turning everything into a C type. While these savings may seem small, they can really add up on operations which are repeated hundreds of thousands of times.

Cython Function Types

Unlike standard Python, Cython has three types of functions. These functions differ in how they are defined and where the can be used.

  • Cdef functions – Cdef functions can only be used in Cython code and cannot be imported into Python.
  • Cpdef functions – can be used and imported in both Python and Cython. If used in Cython they behave as a Cdef function and if used in Python they behave as if they are standard Python function.
  • Def functions – are like your standard Python functions and can be used and imported into Python code

The below code block demonstrates how each of these three function types can be defined.

This allows you to define highly performant Cdef functions for use within Cython modules, while at the same time allowing you to write functions that are totally compatible with Python.  Cpdef functions are a good middle ground, in the sense that when they are used in Cython code they are highly optimized while remaining compatible with Python, should you want to import them into a Python module.


While this introduction only touches the surface of the Cython language, it should be enough to begin optimizing code using Cython. However, some of the more aggressive optimizations and the full power of C types are well beyond the scope of this post.

Detecting Selenium

When looking to extract information from more difficult to scrape sites many programmers turn to browser automation tools such as Selenium and iMacros. At the time of writing, Selenium is by far the most popular option for those looking to leverage browser automation for information retrieval purposes. However, Selenium is very detectable and site owners would be able to block a large percentage of all Selenium users.

Selenium Detection with Chrome

When using Chrome, the Selenium driver injects a webdriver property into the browser’s navigator object. This means it’s possible to write a couple lines of JavaScript to detect that the user is using Selenium. The above code snippet simply checks whether webdriver is set to true and redirects the user should this be the case. I have never seen this technique used in the wild, but I can confirm that it seems to successfully redirect those using Chrome with Selenium.

Selenium Detection with Firefox

Older versions of Firefox used to inject a webdriver attribute into the HTML document. This means that older versions of Firefox could be very simply detected using the above code snippet. At the time of writing Firefox no longer adds this element to pages when using Selenium.

Additional methods of detecting Selenium when using Firefox have also been suggested. Testing seems to suggest that these do not work with the latest builds of Firefox. However, the webdriver standard suggests that this may eventually be implemented in Firefox again.

Selenium Detection with PhantomJS

All current versions of PhantomJS, add attributes to the window element. This allows site owners to simply check whether these specific PhantomJS attributes are set and redirect the user away when it turns out that they are using PhantomJS. It should also be noted that support for the PhantomJS project has been rather inconsistent and the project makes use on an outdated webkit version which is also detectable and could present a security list.

Avoiding Detection

Your best of avoiding detection when using Selenium would require you to use one of the latest builds of Firefox which don’t appear to give off any obvious sign that you are using Firefox. Additionally, it may be worth experimenting with both Safari and Opera which are much less commonly used by those scraping the web. It would also seem likely that Firefox may be giving off some less obvious footprint which would need further investigation to discover.

Scraping & Health Monitoring free proxies with Python

When web-scraping, you often need to source a number of proxies in order to avoid being banned or get around rate limiting imposed by the website in question. This often see’s developers purchasing proxies from some sort of commercial provider, this can become quite costly if you are only need the proxies for a short period of time. So in this post we are going to look at how you might use proxies from freely available proxy lists to scrape the internet.

Problems With Free Proxies

  • Free Proxies Die Very Quickly
  • Free Proxies Get Blocked By Popular Sites
  • Free Proxies Frequently Timeout

While free proxies are great in the sense that they are free, they tend to be highly unreliable. This is due to the fact that up-time is inconsistent and these proxies get blocked quickly by popular sites such as Google. Our solution is also going to build in some monitoring of the current status of the proxy in question. Allowing us to avoid using proxies which are currently broken.

Scraping Proxies

We are going to use free-proxy-list.net, as our source for this example. But the example could easily be expanded to cover multiple sources of proxies. We simply write a simple method which visits the page and pulls out all the proxies from the page in question using our chosen user-agent.  We then store the results in a dictionary, with each proxy acting as a key holding the information relating to that particular proxy. We are not doing any error handling, this will be handled in our ProxyManager class.

Proxy Manager

Our proxy manager is a simply class which allows us to get and manage the proxies we find on free-proxy-list.net. We pass in a test URL which will be used to test whether the proxy is working and a user agent to be used for both scraping and testing the proxies in question. We also create a thread pool, so we can more quickly check the status of the proxies we have scraped. We then call our update_proxy_list, returning the proxies we have found on free-proxy-list.net into our dictionary of proxies.

Checking Proxies

We can now write a couple of methods to test whether a particular proxy works. The first method takes the proxy and the dictionary of information related to the proxy in question. We immediately set the last checked variable to the current time. We make a request against our test URL, with a relatively short timeout. We also then check the status of the request raising an exception should we receive a non-200 status code. Should anything go wrong, we then set the status of the proxy to dead, otherwise we set the status to alive.

We then write our refresh proxy status which simple calls our check proxy status. We iterate over our dictionary, submitting each proxy and the related info of to a thread. If we didn’t use threads to check the status of our proxies, we could be waiting a very long time for our results. We then loop through our results and update the status of proxy in question.

Getting A Proxy

We then write two methods for getting ourselves a proxy. Our first method allows us to get a list of proxies by passing in a relevant key and value. This method allows us to get a list of proxies that relate to a particular country or boasts a particular level anonymity. This can be useful should we be interested in particular properties of a proxy.

We also have a simple method that allows us to return a single working proxy. This returns the first working proxy found within our proxy dictionary by looping over all the items in the dictionary, and returning the first proxy where ‘alive’ is equal to true.

Example Usage

Using the library is pretty simple. We just create the class passing in our test URL (using Google.com here) and our selected user-agent. We then call refresh_proxy_status, updating the status of the scraped proxies by running them against our test URL. We can then pull out an individual working proxy. We can then update our proxy list with a fresh scrape of our source should we not be satisfied with the proxies we currently have access to.

Full Code

Scraping Instagram with Python

In today’s post we are going how to look at how you can extract information from a users Instagram profile. It’s surprisingly easy to extract profile information such as the number of followers a user has and information and image files for a users most recent posts. With a bit of effort it would be relatively easy to extract large chunks of data regarding a user. This could then be applied at a very broad scale to extract a large chunk of all public posts featured on Instagram’s site.

Imports & Setup

We begin by making our imports and writing the dunder init method for our class. Our code requires two packages not included in the standard library, requests for making HTTP Requests and BeautifulSoup to make html parsing more user friendly. If you do not already have these libraries install, you can use the following pip command:

The init method of our class takes two optional keyword arguments, which we simply store in self. This will allow us to override the default user agent list and use a proxy should we wish to avoid detection.

We then write two helper methods. First, we write a very simply method that returns us a random user-agent. Switching user agents is often a best practice when web scraping and can help you avoid detection. Should the caller of our class have provided their own list of user agents we take a random agent from the provided list.  Otherwise we will return our default user agent.

Our second helper method is simply a wrapper around requests. We pass in a URL and try to make a request using the provided user agent and proxy. If we are unable to make the request or Instagram responds with a non-200 status code we simply re-raise the error. If everything goes fine, we return the page in questions HTML.

Extracting JSON from JavaScript

Instagram serve’s all the of information regarding a user in the form of JavaScript object. This means that we can extract all of a users profile information and their recent posts by just making a HTML request to their profile page. We simply need to turn this JavaScript object into JSON, which is very easy to do.

We can write this very hacky, but effective method to extract JSON from a user profile. We apply the static method decorator to this function, as it’s possible to use this method without initializing our class. We simply create a soup from the HTML, select body of the content and then pull out the first ‘script’ tag. We can then simply do a couple text replacements on the script tag, to derive a string which can be loaded into a dictionary object using the json.loads method.

Bringing it all together

We then bring it all together in two functions which we can use to extract information from this very large JSON object. We first make a request to the page, before extracting the JSON result. We then use two different selectors to pull out the relevant bits of information, as the default JSON object has lots of information we don’t really need.

When extracting profile information we extract all attributes from the “user” object, excluding their recent posts. In the “recent posts” function, we use a slightly different selector and pull out all the information about all of the recent posts made by our targeted user.

Example Usage

We can then use the Instagram scraper in a very simply fashion to pull out all the most recent posts from our favorite users in a very simple fashion. You could do lots of things with the resulting data, which could be used in Instagram analytics app for instance or you could simply programmatically download all the images relating to that user.

There is certainly room for improvement and modification. It would also be possible to use Instagram’s graph API, to pull out further posts from a particular user or pull out lists of a users recent followers etc. Allowing you to collect large amounts of data, without having to deal with Facebook’s restrictive API limitations and policies.

Full Code

Writing a web crawler in Python 3.5+ using asyncio

The asyncio library was introduced to Python from versions, 3.4 onwards. However, the async await syntax was not introduced into the language in Python 3.5. The introduction of this functionality allows us to write asynchronous web crawlers, without having to use threads. Getting used to asynchronous programming can take a while, and in this tutorial we are going to build a fully functional web crawler using asyncio and aiohttp.

Fan In & Fan Out Concurrency Pattern

We are going to write a web crawler which will continue to crawl a particular site, until we reach a defined maximum depth. We are going to make use of a fan-in/fan-out concurrency pattern. Essentially, this involves gathering together a set of tasks, and then distributing them across a bunch of threads, or in across co-routines in our case. We then gather all the results together again, before processing them, and fanning out a new group of tasks. I would highly recommend Brett Slatkin’s 2014 talk, which inspired this particular post.

Initializing Our Crawler

We begin by importing the libraries required for our asyncio crawler. We are using a couple libraries which are not included in Python’s standard library. These required libraries can be installed using the following pip command:

We can then start defining our class. Our crawler takes two positional arguments and one optional keyword argument. We pass in the start URL, which is the URL we begin our crawl with and we also set the maximum depth of the crawl. We also pass in a maximum concurrency level which prevents our crawler from making more than 200 concurrent requests at a single time.

The start URL is then parsed to give us the base URL for the site in question. We also create a set of URLs which have already seen, to ensure that we don’t end up crawling the same URL more than once. We also create session using aiohttp.ClientSession so that we can skip, having to create a session every time we scrape a URL. Doing this will throw a warning, but the creation of a client session is synchronous, so it can be safely done outside of a co-routine. We also set up a asyncio BoundedSemaphore using our max concurrency variable, we will use this to prevent our crawler from making too many concurrent requests at one time.

Making An Async HTTP Request

We can then write a function make to a asynchronous HTTP request. Making a single asynchronous request is pretty similar to making a standard HTTP request. As you can see we write “async” prior to the function definition. We begin by using an async context manager, using the bounded semaphore created when we initialized our class. This will limit asynchronous requests to whatever we passed in when creating an instance of AsyncCrawler class.

We then use another async context manager within a try/except block to make a request to the URL, and await for the response. Before we finally return the HTML.

Extracting URLs

We can then write a standard function to extract all the URLs from a html response. We create DOM (Document Object Model) object from our HTML, using Lxml’s HTML sub-module. Once we have extracted our document model, we able to query it using either XPath or CSS selectors. Here we use a simple XPath selector to pull out the ‘href’ element of every link found on the page in question.

We can then use urllib.parse’s urljoin function with our base URL and found href. This gives an absolute URL, automatically resolving any relative URLs that we may have found on the page. If we haven’t already crawled this URL and it belongs to the site we are crawling, we add it to our list of found URLs.

The extract async function is a simple wrapper around our HTTP request and find URL functions. Should we encounter any error, we simply ignore it. Otherwise we use the HTML to create a list of URLs found on that page.

Fanning In/Out

Our extract_multi_async function is where we fan out. The function takes a list of URLs to be crawled. We begin by creating two empty lists. The first will hold the futures which refer to jobs to be done. While the second holds the results of these completed futures. We begin by adding a call to our self.extract_async function for each URL we have passed into the function. These are futures, in the sense that they are tasks which will be completed in the future.

To gather the results from these futures, we use asyncio’s as_completed function, which will iterate over the completed futures and gather the results into our results list. This function will essentially block until all of the futures are completed, meaning that we end up returning a list of completed results.

Running Our Crawler

We have a parser function defined here which will by default raise a NotImplementedError. So in order to use our crawler, we will have to sub class our crawler and write our own parsing function.  Which will do in a minute.

Our main function kicks everything off. We start off by scraping our start URL, and returning a batch of results. We then iterate over our results pulling out the URL, data, and new URLs from each result. We then send the HTML off to be parsed, before appending the relevant data to our list of results. While adding the new URLs to our to_fetch variable. We keep continuing this process until we have reached our max crawl depth, and return all the results collected during the crawl.

Sub Classing & Running the Crawler

Sub-classing the crawler is very simple, as we are able to write any function we wish to handle the HTML data returned by our crawler. The above function simply tries to extract the title from each page found by our crawler.

We can the call the crawler in a similar way to how we would call an individual asyncio function. We first initialize our class, before creating future with the asyncio.Task function passing in our crawl_async function. We then need an event loop to run this function in, which we create and run until the function has completed. We then close the loop and grab the results from our future by calling .result() on our completed future.

Full Code


Image Classification with TFLearn and Python

In today’s post we are going to walk through how to build a flexible image classifier using TFLearn and Python. For those not familiar with TFLearn, it is a wrapper around the very popular Tensorflow library from Google. Building a image classifier with TFLearn is relatively simple and today we are going to walk through how to build your own image classifier.


We are going to need to import a number of different libraries in order to build our classifier. For users on Windows the easiest way to install the Scipy library is to use the pre-compiled wheel which can be found here. Once you have installed all the required imports, we can start building our ImageClassify class.

Initializing Our Class

When initializing our class, we are going to need to know a few pieces of information. We are going to need a list of class names. These are the names of the different objects that our classifier is going to classify. We also need to pass in an image size, the classifier will automatically resize our images into a square image of the specified size. So, if we pass in a value of 100, our classifier will end up resizing our images to be 100×100 pixels in size.

Generally, the larger the image size the better the classification we will end up with. This is provided that your images are larger than the specified value. It should be warned that using larger images will increase the time taken to train the algorithm. We store this value in self.image_size.

We also pass in default values for our learning rate and test split. The learning rate dictates how quickly the machine learning algorithm will discover new features. As a default value 0.001 tends to work well. The test split, defines what percentage of samples we will use to validate our model against. Again, using around ten percent of samples for your test size works pretty well.

We also create empty lists which will end up holding our image data and their respective labels

Labeling An Image

Our extract label function takes an image and extracts the label from the image. We begin by creating an array of zeroes, with a zero for each class to be trained. We then split the file name of the image. Our extract label functions expects images in the following format “class.number.png”. Using this format allows us to extract the class name directly from the image. We then look up the index of the class label and set that value in our array of zeros to a 1. Before returning the array itself.

Processing Images

Our process image function first calls our label function. We then read the image using skimage’s io.imread function. We then resize this image to the size specified when we initialized the class. We then append the image data and the labels to self.image_data and self.labels respectively.

Processing images is simply involves us using our process image function on every single image we provide to our image classification class.

Building Our Model

Our build model function simply builds us a convolutional net model, using the parameters we defined when initializing our class. Explaining the working’s of the net are probably beyond the scope of this post. But I will just note that creating our model like this allows our classifier to be used with images of any size and datasets with any number of classes. Creating a build model function also makes it easier to load and predict using pre-trained models.

Training Our Model

Our train_model function, takes a model name, epochs and a batch size parameter. The epochs parameter determines the number of times the model will be run over the entirety of the dataset. The batch size determines the number of samples to be run through the model at once. Generally, the more epochs the more accurate the model will be. Though too many epochs may mean that your model over fits the dataset and you end up with rather inaccurate predictions when you use the model to make predictions. If accuracy hits 100% and loss goes to 0, this is a very strong indication that you have over fit.

We first begin by creating X, y variables using the self.image_data and self.labels variables. We then use our self.test_split value to split the dataset up into training and test sets. We then call the build model function. We then call the fit method on model using both the train and test sets for validation purposes.

Once we have finished training the model. We save the model using the passed in model name and set self.model to be equivalent to our newly trained model.

Loading A Pre-Trained Model & Predicting Images

We can define a very simple function to load a model. This will be useful when we need to predict images sometime after we have trained a model. We can load a model by simply passing in the model’s name.

We then need another function to take an image and transform it to something we can use in our prediction function. This is much like our process image function, with the exception that we have no need to label the image.

Our predict image function takes a path to an image file. We call our _image_to_array function this data can then be fed straight into the model. Our model will then output an array of probabilities. We can line these up with the classes which we provided to the Image Classify class. We then pull out the most probable label, before returning this and the list of probabilities.

Example Usage: Training A Model

Example Usage: Making  A Prediction With Already Trained Model

Full Code & Example Dataset

The full code and an example data set can be found on my Github here. The Github also contains another image classification model which makes use of Google’s Googlenet model. This model is very highly accurate but takes a considerable amount of time to train and is likely to need to be run for a greater number of epics.

Text Classification with Python & NLTK

Machine learning frameworks such as Tensorflow and Keras are currently all the range, and you can find several tutorials demonstrating the usage of CNN (Convolutional Neural Nets) to classify text. Often this can be overkill and, in this post we are going to show you how to classify text using Python’s NLTK library. The NLTK (Natural Language Toolkit) provides Python users with a number of different tools to deal with text content and provides some basic classification capabilities.

Input Data

In the example, I’m using a set of 10,000 tweets which have been classified as being positive or negative. Our classifier is going to take import in CSV format, with the left column containing the tweet and the right column containing the label. An example of the data can be found below:

Using your own data is very simple and simply requires that your left column contains your text document, while the column on the right contains the correct label. Allowing our classifier to classify a wide range of documents with labels of your choosing. The data used for this example can be downloaded here.

The Bag of Words Approach

We are going to use a bag of words approach. Simply put, we just take a certain number of the most common words found throughout our data set and then for each document we check whether the document contains this word. The bag of words approach is conceptually simple and doesn’t require us to pad documents to ensure that every document in our sample set is the same length. However, the bag of words approach tends to be less accurate than using a word embedding approach. By simply checking whether a document contains a certain set of words we miss out on a lot of valuable information – including the position of the words in a said document. Despite this we can easily train a classifier which can achieve 80%+ accuracy.

Initialising our class and reading our CSV file

Our CSV classifier is going to take several arguments. Firstly, we pass the name of our CSV file. Then as optional parameters we pass featureset_size and a test ratio. By default, our classifier will use the most 1,000 common words found in our dataset to create our feature set. Additionally, we will test the accuracy of our classifier against 10% of the items contained in our data set. We then initialise a few variables which will be used later by our classifier.

We then come on to reading our CSV file. We simply iterate through each line and before splitting by our line by commas. The text after our last comma is the documents label, while everything to the left is the document in question. By applying a regex to our document, we produce a list of words contained in the said document. In the example, I used a very simple regex to pull out the words, but it is possible to replace this with a more complex tokenizer. For each word in the document we append this to the list of words. This will allow us to determine the frequency that words occur in our dataset. We also place the list of words found in our document into the variable where we store all the documents found in our dataset.

Extracting Word Features

We then write our functions for handling generating our feature set. Here, we use NLTK’s Frequency Dist class to store the frequency in which different words where found throughout the dataset. We then iterate through all of the words in the document, creating a new record should we have not seen the word before and incrementing the count should it have already been found. We then limit our bag of words to be equal to the feature set size we passed when we initialised the class.

Now we have got a list of the most frequently found words, we can write a function to generate features for each of the documents in our dataset. As we are using a bag of words approach we are only interested in whether the document contains each word contained in the 1,000 most frequent words. If we find the word we return True, otherwise we return False. Eventually, we get a dictionary of 1,000 features which will be used train the classifier.


We start by shuffling the documents. Some algorithms and classifiers can be sensitive to the order of data. This makes important to shuffle our data before training. We then use our feature set function within a list comprehension which returns us a list of tuples containing our feature set dictionary and the documents label. We then calculate where to split our data into training and test sets. The test set allows us to check how the classifier performs against an unseen dataset. We can then pass our data set to nltk’s naïve bayes classifier. The actual training may take some time and will take longer the larger the dataset used. We then check the classifiers accuracy against both the training and test set. In all likelihood the classifier will perform significantly better against the training set.

Classifying New Documents

Once we have trained a classifier we can then write a function to classify new documents. If we have not already loaded our CSV file and generated the word features, we will have to do this before classifying our new document. We then simply make generate a new set of features for this document before passing this to our classifier class. We will then pass this to our classifier and call the classify method with the feature set. The function will return the string of the predicted label.

Saving and Loading Model

Rather than training the model every time we want to classify the sentence, it would make sense to save the model. We can write two simple functions to allow us to reuse our model whenever we want. The save functions simply saves our classifier and feature words to objects to files, which can then be reloaded by our load model function.


Algorithm Train Test
Naive Bayes Classifier (NLTK) 84.09% 72.89%
BernouliNB (Sklearn) 83.93% 79.78%
MultinomiaNB (Sklearn) 84.58% 74.67%
LogisticRegression (Sklearn) 89.05% 75.33%
SGDClassifier (Sklearn) 81.23% 69.32%

The algorithm performs relatively well against our example data. Being able to correctly classify whether a Tweet is positive or negative around 72% of the time. NLTK gives it’s users the option to replace the standard Naive Bayes Classifier with a number of other classifiers found in the Sci-kit learn package. I ran the same test swapping in these classifiers for the Naive Bayes Classifier, and a number of these classifiers significantly outperformed the standard naive classifier. As you can see the BernouliNB model performed particularly well, correctly classifying documents around 80% of the time.

The accuracy of the classifier could further be improved by using something called an ensemble classifier. To build an ensemble classifier we would simply build several models using different classifiers and then classify new documents against all of these classifiers. We could then select the answer which was provided by the majority of our classifiers (hard voting classifier).  Such a classifier would likely outperform just using on of the above classifiers. The full code below provides a function that allows you to try out other Sklearn classifiers.

Example Usage

The class is pretty easy to use. The above code outlines all of the steps required to train a classifier and classify an unseen sentence. More usage examples and the full code can be found on Github here.

Full Code


Scraping Baidu with Python


What’s Baidu?

Baidu is China’s largest search engine and has been since Google left the market in {year}. As companies look to move into the Chinese market, there has been more and more interest in scraping search results from Baidu.

Scraping Baidu

Scraping Baidu is a relatively simple task. When scraping results from Baidu there is only minor challenge, the URLs displayed on the Baidu results page are found nowhere in the HTML. Baidu links to the sites displayed on the search results page via their own redirector service. In order to get the full final URL we have to follow these redirects. In this post we are going to walk through how to scrape the Baidu search results page.

Imports & Class Definition

In order to scrape Baidu, we only need to import two libraries outside of the standard library. Bs4 helps us parse HTML, while requests provides us with a nicer interface for making HTTP requests with Python.

As we are going to scrape multiple pages of Baidu in this tutorial and for this purpose we are going to initialise a class to hold onto the important information for us.

We initialise a new class of the BaiduBot, with a search term and the number of pages to scrape. We also give ourselves the ability to pass a number of keyword arguments to our class. This allows us to pass a proxy, a custom connection timeout, custom user agent and an optional delay between each of the results page we want to scrape. The keyword arguments may be of a lot of help, if we end up being block by Baidu.  When initialising the class we also store our base URL, which we use when scraping the subsequent pages.

Making Requests & Parsing HTML

We first define a function to scrape a page of Baidu, here we simply try to make a request and check that the response has a 200 Status. Should Baidu start serving us with non-200 status codes, this likely means that they have detected unusual behaviour from our IP and we should probably back off for a while. If there is no issue with the request, we simply return the response object.

Now that we have a way to make HTML requests, we need to write a method for parsing the results page. Our parser is going to take in the HTML and return us with a list of dictionary objects. Each result is handily contained within a ‘div’ called ‘c-container’. This makes it very easy for us to pick out each result. We can then iterate across all of our returned results, using relatively simply BeautifulSoup selectors. Before appending the result to our results list.

Getting the Underlying URL

As previously mentioned the full underlying URL is not displayed anywhere in Baidu’s search results. This means we must write a couple of functions to extract the full underlying URL. There may be another way to get this URL, but I’m not aware of it. If you know how, please share the method with me in the comments.

Our resolve_urls function is very similar to our Baidu request function. Instead of a response object we are returning the final URL by simply following the chain of redirects. Should we encounter any sort of error we are simply returning the original URL, as found within the search results. But this issue is relatively rare, so it shouldn’t impact our data too much.

The we write another function that allows us to use our resolve_urls function over a set of results, updating the URL within our dictionary with the real underlying URL and the rank of the URL in question.

Bringing It All Together

We bring this altogether in our scrape_baidu function. We range over our page count variable. For each loop we run through we multiple by our variable by 10, to get the correct pn variable. The pn variable represents the result index, so our logic ensures we start at 0 and continue on in 10 result increments. We then format our URL using both our search term and this variable. We then simply make the request and parse the page using the functions we have already written. Before appending the results to our final results variable. Should we have passed a delay argument, we will also sleep for a while before scraping the next page. This will help us avoided getting banned should we want to scrape multiple pages and search terms.

Full Code