Quick Tour of The Concurrent Futures Library

Threading in Python often gets a bad rap, however the situation around threading has gotten a lot better since the Concurrent Futures library was introduced in version 3.2 of Python. Python threads however will only give you an increase in performance in certain circumstances, firstly it is only recommended to use threads for IO bound tasks. The reason for this is to complicated to go into here, but are related to the workings of Python’s GIL (Global Interpreter Lock) Those looking to improve performance of CPU heavy tasks instead need to make use of multiprocessing. In fact, the concurrent futures library provides same interface for working with both threads and processes. The code in this post will focus on using the library with threads but many of the same patterns can be applied to code making use of a process pool.

Thread Pools

A core part of the concurrent futures library is the ability to create a thread pool executor. Thread pools make it much easier to manage a bunch of threads. We simple create an instance of a thread pool and set the number of threads we want to use. We can then submit jobs to be run the thread pool. This first example just shows us how to submit a job to a thread, however at this point we have no way to work with the results.

Futures – As Completed

We are going to begin looking at the as completed pattern. This allows to submit a bunch of different tasks to our thread pool and retrieve the results as the tasks have been completed. This construct can be very handy if we want to do a bunch of blocking IO tasks and then process the results once all the tasks have been completed. In the above example, we use the get_page function as before which has be omitted for brevity.

Here we simply submit our simple task to the thread pool executor we can then wait for all of our submitted tasks to be completed. We can also set an optional timeout which will see our tasks timeout if they take too long.

Futures – Mapping

We can also use the thread pools executor’s map function to take a group of tasks and then map these to the different threads in our thread pool. The first argument to the call is the function in question and then a iterable of arguments to be passed into the function this then returns us a list of futures. Getting results from this relatively easy, and we can get the final results of the tasks in question by simply calling list over our iterable of futures objects. This then gives us a list of results which can work with the list of results just as if they had been returned from a normal function.

Futures – Callback

Callbacks provide Python users one of the most powerful methods for working with thread pools. With a callback we can submit a task to our thread pool and then call add_done_callback to the future object returned from our thread pool submission. What is more tricky is that the callback takes only one argument which is the result of the future. We can then perform various actions to the future result, such as checking for whether an exception was thrown or whether the future was cancelled before it was able to complete it’s task. We can then finally handle  and process the result of the future. This allows for some more complicated concurrent programming patterns with the callbacks feeding a queue of additional jobs to be processed.

Top Python Books

As a self-taught software engineer books played an important role in moving from digital marketing into software engineering. While many people these days seem to prefer learning via video courses, I have always found books the better tool in my experience.

For Beginners

Automate The Boring Stuff with Python

This well known book written by the effable Al Sweigart, probably represents one of the better books for absolute beginners. The book introduces basic programming concepts such as functions and types while allowing users to write short programs that do something useful. I think this approach very appropriate for absolute beginners who need to see the power of what they can do with code. The only weakness of the book is that subjects such as classes are completely passed over. I can however highly recommend the book to anyone who is interested in picking up Python and wants to get up to speed relatively quickly. The book can also be read online for free.

Think Python

Think Python started out as a Python translation of a Java textbook aimed at new Computer Science students. The author of the original textbook then went on to update the book with Python as the core language. The book introduces users to the Python language and aims to get them thinking like computer scientists. This book is a great choice for those who are more serious about Python and want to pursue programming as a career. However, it is likely less instantly gratifying than the ‘Automate The Boring Stuff’ book. Think Python can also be found online for free, which makes it another great choice for beginners.

Grokking Algorithms

Grokking Algorithms is technically not a Python book, but all of the examples contained in the book are written in Python. The book provides readers with a good introduction to algorithmic thinking and several core algorithms that are widely used in computer science. This book should definitely not be your first programming book but should you have made progress with both ‘Think Python’ and ‘Automate The Boring Stuff’ this book could certainly be worth reading and is particularly useful if you are looking to interview for junior level roles.

 

For More Advanced Readers

Fluent Python

Fluent Python is considered a classic among intermediate to advanced Python programmers with the book taking the reader through a comprehensive tour of Python 3. The book is very well written and those reading it are likely to come away with a better understanding of the language as well as picking up some new perspectives on more familiar topics. This is certainly a book for more advanced readers and it took me a while to fully appreciate the book. I would thoroughly recommend the book to upper intermediate Python users and don’t have any real complaints about the books content.  The only minor gripe would be that the Asyncio examples in the book are already outdated using the old 3.4 coroutine syntax which is likely to be deprecated in future versions of Python.

Effective Python

This relatively short but valuable book by Brett Slatkin is effectively a tour of Python best practices. The book provides readers with 59 specific ways to write better Python. The book is broken into chapters which gather points together in an understandable manner. The breadth covered by the book is impressive with chapters on everything from Pythonic Thinking to Concurrency and Parallelism. The topics covered by the book get more advanced the further you progress through the book, meaning that there is something for everyone. Again, another great feature of this book, is the ability to jump into a specific section should you be looking to refresh your memory on a specific subject. Every so often I take a dive back into Effective Python when I want a refresher on a particular topic.

Python Cookbook

The Python Cookbook is not a textbook nor does it deal with a subject matter topic. It is instead focused on showing readers how they can deal with a range of different programming problems in a Pythonic way. The book is broken into a number of different sections dealing with a diverse range of problems. Each problem comes with one or more solutions and discussion regarding the solution. Reading through the book is enlightening and you are likely to learn a significant amount through consuming the book. The structure of the book is also making it easy to dive in and out should you only be interested in solutions to a particular problem.

Python Tricks

The name of this book does it a bit of a disservice. This book is great a resource for intermediate level Python developers looking to push their skills to the next level. Those who are relatively new to the language will definitely pick up some useful information on how to use Python more smartly. The book was only released relatively recently put has picked up a huge number of glowing reviews.

Using Yandex’s CatBoost

Many people won’t have heard of Yandex, but the company is major player in the search space in Russia and the former Soviet Union. Yandex have launched several open source projects, one of the most interesting being CatBoost.

CatBoost is a machine learning library from Yandex which is particularly targeted at classification tasks that deal with categorical data. Many datasets contain lots of information which is categorical in nature and CatBoost allows you to build models without having to encode this data to one hot arrays and the such. The library can also be used with other machine learning libraries such as Keras and Tensorflow. I am going to focus on how the library can be used to build models classifying categorical data.

I highly recommend you watch the above talk from one of the creators of the library where she goes into greater detail about the library and how it can be used in variety of different contexts.

Building A Generic Model

The example in this post is going to use on of the demo datasets included with the CatBoost library. Namely, the titanic dataset which contains information about passengers on the Titanic and allows us to predict whether someone would survive based on a number of different features. While the example code uses the demo dataset, it should be generic enough to replace with your own dataset with only minor modifications.

We begin by initiating our CatTrainer class. We simply pass in the Pandas data frame which we are interested in using to train our model. We also initialise several other variables which for the time being we set to none. These will be used later in our code when preparing and training our model.

Next our protected replace null values method is a simple helper function that replaces any null values with the value -999. This value can be overridden with another value should the user have a more appropriate default in mind.

We then write our preparation method which prepares our X and Y values. We pass in our label and default null value should we wish to use one. This then creates our X and Y values without much overhead on our part.

We then come on to the task of creating our model. For this we are going to write two functions. Our first function is a simple function which either returns some sane defaults or overrides the values with the user’s input should the user want to specify specific aspects of the model. We then simply use these values to create a model and assign it to the self.model variable.

We can then write the function that trains our model. We then again pass in several default arguments to this function with some relatively sane defaults. We then split our X and Y values into training and test data. If the user has not chosen which indexes our categorical data, then we automatically try to determine this by checking the value of the common. Should we have not already created or loaded an external model we then call the create_model function. Finally, we call the fit method using all the relevant information.

We also write a quick cross validation function which allows us to simply verify how accurate our trained model actually is. This allows to quickly benchmark the performance of the model we wish to train.

Two additional functions allow for the saving and loading of models. Which simply wrap around functions contained in the CatBoost library.

Finally, we have a simple method that allows us to predict the labels of a passed in data frame. We simply replace the missing values in the same way as before. Before then calling the internal predict function with the passed data frame and returning the value of the predictions.

Using the code

Above we have an example of how this class can be used with the included Titanic dataset. With minor changes it should be possible to use the class with other datasets. This shows us just how easy it is to produce powerful models with relatively little code with the help of CatBoost. You can find the full code on Github and feel free to ask any questions below in the comments.

Basic Introduction to Cython

Python is often criticized for being slow. In many cases pure Python is fast enough, there are certain cases where Python may not give you the performance you need. In recent years a fair number of Python programmers have made the jump to Golang for performance reasons. However there are a number of ways you can improve the performance of your Python code such as using PyPy, writing C extensions or trying your hand at Cython.

What is Cython?

Cython is a superset of the Python language. This means that the vast majority of Python code, is also valid Cython code. Cython allows users to write Cython modules which are then compile and can be used within in Python code. This means that users can port performance critical code into Cython and instantly see increases in performance. The great thing about Cython code is that you can determine how much to optimize your code.  Simply copying and compiling your Python code might see you make performance gains of 8-12%, whereas more serious optimization of your code can lead to significantly better performance.

Installing Cython

Installing Cython on Linux is very easy to do and just requires you to use the ‘pip install cython’ command. Those on Windows devices will likely have a much tougher time, with the simplest solution seeming to be just installing ‘Visual Code Community’ and selecting both the C++ and Python support options. You can then just install Cython like you would any other Python package.

Using Pure Python

We are going to begin with compiling a pure Python function. This is a very simple task, and can achieve some limited performance benefits, with a more noted increase in performance for functions which make use of for and while loops. To begin we simply save the below code into a file called ‘looping.pyx’.

This very simple function takes a list of numbers and then multiples each number by it’s index, and returning the sum of the results. This code is both valid Python and Cython code. However, it takes no advantage of any Cython optimizations other than the compilation of the code into C.

We run the below command to create Cython module which can be used in Python:

We can then import our Cython module into Python in the following manner:

What kind of performance benefits can we expect from just compiling this Python code into a Cython module?

I ran some tests and on average the Cython compiled version of the code took around 10% less time to run over a set of 10,000 numbers.

Adding Types

Cython achieves optimization of code by introducing typing to Python code. Cython supports both a range of Python and C types. Python types tend to be more flexible but give you less in terms of performance benefits.  The below example makes use of both C and Python types, however we have to be very careful when using C types. For instance we could throw an overflow error should the list of numbers we pass in be too large and the result of the multiplication being to large to store in a C long.

As you can see we use the Python type ‘list’ to type annotate our input list. We then define two C types which will be used to store the length of our list and our output. We then loop over our list in exactly the same way as we did in our previous example. This shows just how easy it is to start adding C types to code with the help of Python. It also illustrates how easy it is to mix both C and Python types together in one extension module.

This hybrid code when tested was between 15-30% faster than the pure Python implementation without taking the most aggressive path of optimization and turning everything into a C type. While these savings may seem small, they can really add up on operations which are repeated hundreds of thousands of times.

Cython Function Types

Unlike standard Python, Cython has three types of functions. These functions differ in how they are defined and where the can be used.

  • Cdef functions – Cdef functions can only be used in Cython code and cannot be imported into Python.
  • Cpdef functions – can be used and imported in both Python and Cython. If used in Cython they behave as a Cdef function and if used in Python they behave as if they are standard Python function.
  • Def functions – are like your standard Python functions and can be used and imported into Python code

The below code block demonstrates how each of these three function types can be defined.

This allows you to define highly performant Cdef functions for use within Cython modules, while at the same time allowing you to write functions that are totally compatible with Python.  Cpdef functions are a good middle ground, in the sense that when they are used in Cython code they are highly optimized while remaining compatible with Python, should you want to import them into a Python module.

Conclusion

While this introduction only touches the surface of the Cython language, it should be enough to begin optimizing code using Cython. However, some of the more aggressive optimizations and the full power of C types are well beyond the scope of this post.

Detecting Selenium

When looking to extract information from more difficult to scrape sites many programmers turn to browser automation tools such as Selenium and iMacros. At the time of writing, Selenium is by far the most popular option for those looking to leverage browser automation for information retrieval purposes. However, Selenium is very detectable and site owners would be able to block a large percentage of all Selenium users.

Selenium Detection with Chrome

When using Chrome, the Selenium driver injects a webdriver property into the browser’s navigator object. This means it’s possible to write a couple lines of JavaScript to detect that the user is using Selenium. The above code snippet simply checks whether webdriver is set to true and redirects the user should this be the case. I have never seen this technique used in the wild, but I can confirm that it seems to successfully redirect those using Chrome with Selenium.

Selenium Detection with Firefox

Older versions of Firefox used to inject a webdriver attribute into the HTML document. This means that older versions of Firefox could be very simply detected using the above code snippet. At the time of writing Firefox no longer adds this element to pages when using Selenium.

Additional methods of detecting Selenium when using Firefox have also been suggested. Testing seems to suggest that these do not work with the latest builds of Firefox. However, the webdriver standard suggests that this may eventually be implemented in Firefox again.

Selenium Detection with PhantomJS

All current versions of PhantomJS, add attributes to the window element. This allows site owners to simply check whether these specific PhantomJS attributes are set and redirect the user away when it turns out that they are using PhantomJS. It should also be noted that support for the PhantomJS project has been rather inconsistent and the project makes use on an outdated webkit version which is also detectable and could present a security list.

Avoiding Detection

Your best of avoiding detection when using Selenium would require you to use one of the latest builds of Firefox which don’t appear to give off any obvious sign that you are using Firefox. Additionally, it may be worth experimenting with both Safari and Opera which are much less commonly used by those scraping the web. It would also seem likely that Firefox may be giving off some less obvious footprint which would need further investigation to discover.

Scraping & Health Monitoring free proxies with Python

When web-scraping, you often need to source a number of proxies in order to avoid being banned or get around rate limiting imposed by the website in question. This often see’s developers purchasing proxies from some sort of commercial provider, this can become quite costly if you are only need the proxies for a short period of time. So in this post we are going to look at how you might use proxies from freely available proxy lists to scrape the internet.

Problems With Free Proxies

  • Free Proxies Die Very Quickly
  • Free Proxies Get Blocked By Popular Sites
  • Free Proxies Frequently Timeout

While free proxies are great in the sense that they are free, they tend to be highly unreliable. This is due to the fact that up-time is inconsistent and these proxies get blocked quickly by popular sites such as Google. Our solution is also going to build in some monitoring of the current status of the proxy in question. Allowing us to avoid using proxies which are currently broken.

Scraping Proxies

We are going to use free-proxy-list.net, as our source for this example. But the example could easily be expanded to cover multiple sources of proxies. We simply write a simple method which visits the page and pulls out all the proxies from the page in question using our chosen user-agent.  We then store the results in a dictionary, with each proxy acting as a key holding the information relating to that particular proxy. We are not doing any error handling, this will be handled in our ProxyManager class.

Proxy Manager

Our proxy manager is a simply class which allows us to get and manage the proxies we find on free-proxy-list.net. We pass in a test URL which will be used to test whether the proxy is working and a user agent to be used for both scraping and testing the proxies in question. We also create a thread pool, so we can more quickly check the status of the proxies we have scraped. We then call our update_proxy_list, returning the proxies we have found on free-proxy-list.net into our dictionary of proxies.

Checking Proxies

We can now write a couple of methods to test whether a particular proxy works. The first method takes the proxy and the dictionary of information related to the proxy in question. We immediately set the last checked variable to the current time. We make a request against our test URL, with a relatively short timeout. We also then check the status of the request raising an exception should we receive a non-200 status code. Should anything go wrong, we then set the status of the proxy to dead, otherwise we set the status to alive.

We then write our refresh proxy status which simple calls our check proxy status. We iterate over our dictionary, submitting each proxy and the related info of to a thread. If we didn’t use threads to check the status of our proxies, we could be waiting a very long time for our results. We then loop through our results and update the status of proxy in question.

Getting A Proxy

We then write two methods for getting ourselves a proxy. Our first method allows us to get a list of proxies by passing in a relevant key and value. This method allows us to get a list of proxies that relate to a particular country or boasts a particular level anonymity. This can be useful should we be interested in particular properties of a proxy.

We also have a simple method that allows us to return a single working proxy. This returns the first working proxy found within our proxy dictionary by looping over all the items in the dictionary, and returning the first proxy where ‘alive’ is equal to true.

Example Usage

Using the library is pretty simple. We just create the class passing in our test URL (using Google.com here) and our selected user-agent. We then call refresh_proxy_status, updating the status of the scraped proxies by running them against our test URL. We can then pull out an individual working proxy. We can then update our proxy list with a fresh scrape of our source should we not be satisfied with the proxies we currently have access to.

Full Code

Scraping Instagram with Python

In today’s post we are going how to look at how you can extract information from a users Instagram profile. It’s surprisingly easy to extract profile information such as the number of followers a user has and information and image files for a users most recent posts. With a bit of effort it would be relatively easy to extract large chunks of data regarding a user. This could then be applied at a very broad scale to extract a large chunk of all public posts featured on Instagram’s site.

Imports & Setup

We begin by making our imports and writing the dunder init method for our class. Our code requires two packages not included in the standard library, requests for making HTTP Requests and BeautifulSoup to make html parsing more user friendly. If you do not already have these libraries install, you can use the following pip command:

The init method of our class takes two optional keyword arguments, which we simply store in self. This will allow us to override the default user agent list and use a proxy should we wish to avoid detection.

We then write two helper methods. First, we write a very simply method that returns us a random user-agent. Switching user agents is often a best practice when web scraping and can help you avoid detection. Should the caller of our class have provided their own list of user agents we take a random agent from the provided list.  Otherwise we will return our default user agent.

Our second helper method is simply a wrapper around requests. We pass in a URL and try to make a request using the provided user agent and proxy. If we are unable to make the request or Instagram responds with a non-200 status code we simply re-raise the error. If everything goes fine, we return the page in questions HTML.

Extracting JSON from JavaScript

Instagram serve’s all the of information regarding a user in the form of JavaScript object. This means that we can extract all of a users profile information and their recent posts by just making a HTML request to their profile page. We simply need to turn this JavaScript object into JSON, which is very easy to do.

We can write this very hacky, but effective method to extract JSON from a user profile. We apply the static method decorator to this function, as it’s possible to use this method without initializing our class. We simply create a soup from the HTML, select body of the content and then pull out the first ‘script’ tag. We can then simply do a couple text replacements on the script tag, to derive a string which can be loaded into a dictionary object using the json.loads method.

Bringing it all together

We then bring it all together in two functions which we can use to extract information from this very large JSON object. We first make a request to the page, before extracting the JSON result. We then use two different selectors to pull out the relevant bits of information, as the default JSON object has lots of information we don’t really need.

When extracting profile information we extract all attributes from the “user” object, excluding their recent posts. In the “recent posts” function, we use a slightly different selector and pull out all the information about all of the recent posts made by our targeted user.

Example Usage

We can then use the Instagram scraper in a very simply fashion to pull out all the most recent posts from our favorite users in a very simple fashion. You could do lots of things with the resulting data, which could be used in Instagram analytics app for instance or you could simply programmatically download all the images relating to that user.

There is certainly room for improvement and modification. It would also be possible to use Instagram’s graph API, to pull out further posts from a particular user or pull out lists of a users recent followers etc. Allowing you to collect large amounts of data, without having to deal with Facebook’s restrictive API limitations and policies.

Full Code

Writing a web crawler in Python 3.5+ using asyncio

The asyncio library was introduced to Python from versions, 3.4 onwards. However, the async await syntax was not introduced into the language in Python 3.5. The introduction of this functionality allows us to write asynchronous web crawlers, without having to use threads. Getting used to asynchronous programming can take a while, and in this tutorial we are going to build a fully functional web crawler using asyncio and aiohttp.

Fan In & Fan Out Concurrency Pattern


We are going to write a web crawler which will continue to crawl a particular site, until we reach a defined maximum depth. We are going to make use of a fan-in/fan-out concurrency pattern. Essentially, this involves gathering together a set of tasks, and then distributing them across a bunch of threads, or in across co-routines in our case. We then gather all the results together again, before processing them, and fanning out a new group of tasks. I would highly recommend Brett Slatkin’s 2014 talk, which inspired this particular post.

Initializing Our Crawler

We begin by importing the libraries required for our asyncio crawler. We are using a couple libraries which are not included in Python’s standard library. These required libraries can be installed using the following pip command:

We can then start defining our class. Our crawler takes two positional arguments and one optional keyword argument. We pass in the start URL, which is the URL we begin our crawl with and we also set the maximum depth of the crawl. We also pass in a maximum concurrency level which prevents our crawler from making more than 200 concurrent requests at a single time.

The start URL is then parsed to give us the base URL for the site in question. We also create a set of URLs which have already seen, to ensure that we don’t end up crawling the same URL more than once. We also create session using aiohttp.ClientSession so that we can skip, having to create a session every time we scrape a URL. Doing this will throw a warning, but the creation of a client session is synchronous, so it can be safely done outside of a co-routine. We also set up a asyncio BoundedSemaphore using our max concurrency variable, we will use this to prevent our crawler from making too many concurrent requests at one time.

Making An Async HTTP Request

We can then write a function make to a asynchronous HTTP request. Making a single asynchronous request is pretty similar to making a standard HTTP request. As you can see we write “async” prior to the function definition. We begin by using an async context manager, using the bounded semaphore created when we initialized our class. This will limit asynchronous requests to whatever we passed in when creating an instance of AsyncCrawler class.

We then use another async context manager within a try/except block to make a request to the URL, and await for the response. Before we finally return the HTML.

Extracting URLs

We can then write a standard function to extract all the URLs from a html response. We create DOM (Document Object Model) object from our HTML, using Lxml’s HTML sub-module. Once we have extracted our document model, we able to query it using either XPath or CSS selectors. Here we use a simple XPath selector to pull out the ‘href’ element of every link found on the page in question.

We can then use urllib.parse’s urljoin function with our base URL and found href. This gives an absolute URL, automatically resolving any relative URLs that we may have found on the page. If we haven’t already crawled this URL and it belongs to the site we are crawling, we add it to our list of found URLs.

The extract async function is a simple wrapper around our HTTP request and find URL functions. Should we encounter any error, we simply ignore it. Otherwise we use the HTML to create a list of URLs found on that page.

Fanning In/Out

Our extract_multi_async function is where we fan out. The function takes a list of URLs to be crawled. We begin by creating two empty lists. The first will hold the futures which refer to jobs to be done. While the second holds the results of these completed futures. We begin by adding a call to our self.extract_async function for each URL we have passed into the function. These are futures, in the sense that they are tasks which will be completed in the future.

To gather the results from these futures, we use asyncio’s as_completed function, which will iterate over the completed futures and gather the results into our results list. This function will essentially block until all of the futures are completed, meaning that we end up returning a list of completed results.

Running Our Crawler

We have a parser function defined here which will by default raise a NotImplementedError. So in order to use our crawler, we will have to sub class our crawler and write our own parsing function.  Which will do in a minute.

Our main function kicks everything off. We start off by scraping our start URL, and returning a batch of results. We then iterate over our results pulling out the URL, data, and new URLs from each result. We then send the HTML off to be parsed, before appending the relevant data to our list of results. While adding the new URLs to our to_fetch variable. We keep continuing this process until we have reached our max crawl depth, and return all the results collected during the crawl.

Sub Classing & Running the Crawler

Sub-classing the crawler is very simple, as we are able to write any function we wish to handle the HTML data returned by our crawler. The above function simply tries to extract the title from each page found by our crawler.

We can the call the crawler in a similar way to how we would call an individual asyncio function. We first initialize our class, before creating future with the asyncio.Task function passing in our crawl_async function. We then need an event loop to run this function in, which we create and run until the function has completed. We then close the loop and grab the results from our future by calling .result() on our completed future.

Full Code

 

Writing A Web Crawler in Golang

I have previously written a piece looking at how to write a web crawler using Go and popular framework Colly. However, it is relatively simple to write a relatively powerful web crawler in Golang without the help of any frameworks. In this post, we are going to write a web crawler using just Golang and the Goquery package to extract HTML elements. All in all, we can write a  fast but relatively basic web crawler in around a 130 lines of code.

Defining Our Parser Interface

First, we import all the packages we need from the standard library. We then pull in goquery, which we will use to extract data from the HTML returned by our crawler. If you don’t already have goquery, you will need to go grab it with go get.

When then define our our ScrapeResult struct, which contains some very simple data regarding the page. This could easily be expanded to return more useful information or to extract certain valuable information. When then define a Parser interface which allows users of our democrawl package to define their own parser to use with the basic crawling logic.

Making HTTP Requests

We are going to write a function which simply attempts to grab a page by making a GET request. The function simply takes in a URL, and makes a request using the default Googlebot agent, to hopefully avoid any detection. Should we encounter no issues, we simply return a pointer to the http.Response. Should something go wrong we return nil and the error thrown by the GET request.

Extracting Links And Resolving Relative URLs

Our crawl is going to restrict itself to crawling URLs found on the domain of our start URL. To achieve this, we are going to write two functions. Firstly, we are going to write a function which discovers all the links on a page. Then we will need a function to resolve relative URLs (URLs starting with “/”).

Our extract links function takes in a pointer to a goquery Document and returns a slice of string. This is relatively easy to do. We simply create a new slice of strings. Should we have passed in a document, we simply find each link element and extract it’s href attribute. This is then added to our slice of URLs.

We then have our resolveRelative function. As the name suggests this function resolves relative links and returns us a slice of all the internal links we found on a page. We simply iterate over our slice of foundUrls, if the URL starts with the sites baseURL we add it straight to our slice. If the URL begins with “/”, we do some string formatting to get the absolute URL in question. Should the URL not belong to the domain we are crawling we simply skip it.

Crawling A Page

We can then start bring all of our work together with a function that crawls a single page. This function takes a number of arguments, we pass in our base URL and the URL we want to scrape. We also pass in the parser we have defined in our main.go function. We also pass in a channel of empty structs, which we use as a semaphore. This allows us to limit the number of requests we make in parallel, as reading from a channel in the above manner is blocking.

We make our requests, then create a goquery Document from the response. This document is used by both our ParsePage function and our extractLinks function. We then resolve the found URLs, before returning them and the results found by the our parser.

Getting Our Base URL

We can pull out our baseURL by using the net/url package’s Parse function. This allows us to simply parse our start URL into our main Crawl function. After we parse the URL, we simply join together the scheme and host using basic string formatting.

Crawl Function

Our crawl function brings together all the other functions we have written and contains quite a lot of it’s own logic. We begin by creating a empty slice of ScrapeResult’s. We then create a workList channel which will contain a list of URLs to scrape. We also initialize an integer value and set it to one. We also create a channel of tokens which will be passed into our crawl page function and limit the total concurrency as defined when we launch the crawler. We then parse our start URL, to get our baseDomain which is used in multiple places within our crawling logic.

Our main for loop is rather complicated. But we essentially create a new goroutine for each item, in our work list. This doesn’t mean we scrape every page at once, due to the fact that we use our tokens channel as a semaphore. We call our crawlPage function, pulling out the results from our parser and all the internal links found. These foundLinks are then put into our workList and the process continues until we run out of new links to crawl.

Our main.go file

We can then write a very simple main.go function where we create an instance of our parser. Then simply call our Crawl function, and watch our crawler go out and collect results. It should be noted that the crawler is very fasted and should be used with very low levels of concurrency in most instances. The democrawl repo can be found on my Github, feel free to use the code and expand and modify it to fit your needs.

Writing a Web Crawler with Golang and Colly

This blog features multiple posts regarding building Python web crawlers, but the subject of building a crawler in Golang has never been touched upon. There are a couple of frameworks for building web crawlers in Golang, but today we are going to look at building a web crawler using Colly. When I first started playing with the framework, I was shocked how quick and easy it was to build a highly functional crawler with very few lines of Go code.

In this post we are going to build a crawler, which crawls this site and extracts the URL, title and code snippets from every Python post on the site. To write such a crawler we only need to write a total of 60 lines of code! Colly requires an understanding of CSS Selectors which is beyond the scope of this post, but I recommend you take a look at this cheat sheet.

Setting Up A Crawler

To begin with we are going to set up our crawler and create the data structure to store our results in. First, of all we need to install Colly using the go get command. Once this is done we create a new struct which will represent an article, and contains all the fields we are going to be collecting with our simple example crawler.

With this done, we can begin writing our main function. To create a new crawler we must create a NewCollector, which itself returns a Collector instance. The NewCollector function takes a list of functions which are used to initialize our crawler. In our case we are only calling one function within our NewCollector function, which is limiting our crawler to pages found on “edmundmartin.com”.

Having done this we then place some limits on our crawler. As Golang, is a very performant and many websites are running on relatively slow servers we probably want to limit the speed of our crawler. Here, we are setting up a limiter which matches everything contains “edmundmartin” in the URL. By setting the parallelism to 1 and setting a delay of a second, we are ensuring that we only crawl one URL a second.

Basic Crawling Logic

To collect data from our target site, we need to create a clone of our Colly collector. We also create a slice of our ‘Article’ struct to store the results we will be collecting. We also add a callback to our crawler which will fire every time we make a new request, this callback just prints the URL which are crawler will be visiting.

We then add another “OnHTML” callback which is fired every time the HTML is returned to us. This is attached to our original Colly collector instance and not the clone of the Collector. Here we pass in CSS Selector, which pulls out all of the href’s on the page. We can also use some logic contained within the Colly framework which allows us to resolve to URL in question. If URL contains ‘python’, we submit it to our cloned to Collector, while if ‘python’ is absent from the URL we simply visit the page in question. This cloning of our collector allows us to define different OnHTML parsers for each clone of original crawler.

Extracting Details From A Post

We can now add an ‘OnHTML’ callback to our ‘detailCollector’ clone. Again we use a CSS Selector to pull out the content of each post contained on the page. From this we can extract the text contained within the post’s “H1” tag. We finally, then pick out all of the ‘div’ containing the class ‘crayon-main’, we then iterate over all the elements pulling out our code snippets. We then add our collected data to our slice of Articles.

All there is left to do, is start of the crawler by calling our original collector’s ‘Visit’ function with our start URL. The example crawler should finish within around 20 seconds. Colly makes it very easy to write powerful crawlers with relatively little code. It does however take a little while to get used the callback style of the programming.

Full Code