Writing A Web Crawler in Golang

I have previously written a piece looking at how to write a web crawler using Go and popular framework Colly. However, it is relatively simple to write a relatively powerful web crawler in Golang without the help of any frameworks. In this post, we are going to write a web crawler using just Golang and the Goquery package to extract HTML elements. All in all, we can write a  fast but relatively basic web crawler in around a 130 lines of code.

Defining Our Parser Interface

First, we import all the packages we need from the standard library. We then pull in goquery, which we will use to extract data from the HTML returned by our crawler. If you don’t already have goquery, you will need to go grab it with go get.

When then define our our ScrapeResult struct, which contains some very simple data regarding the page. This could easily be expanded to return more useful information or to extract certain valuable information. When then define a Parser interface which allows users of our democrawl package to define their own parser to use with the basic crawling logic.

Making HTTP Requests

We are going to write a function which simply attempts to grab a page by making a GET request. The function simply takes in a URL, and makes a request using the default Googlebot agent, to hopefully avoid any detection. Should we encounter no issues, we simply return a pointer to the http.Response. Should something go wrong we return nil and the error thrown by the GET request.

Extracting Links And Resolving Relative URLs

Our crawl is going to restrict itself to crawling URLs found on the domain of our start URL. To achieve this, we are going to write two functions. Firstly, we are going to write a function which discovers all the links on a page. Then we will need a function to resolve relative URLs (URLs starting with “/”).

Our extract links function takes in a pointer to a goquery Document and returns a slice of string. This is relatively easy to do. We simply create a new slice of strings. Should we have passed in a document, we simply find each link element and extract it’s href attribute. This is then added to our slice of URLs.

We then have our resolveRelative function. As the name suggests this function resolves relative links and returns us a slice of all the internal links we found on a page. We simply iterate over our slice of foundUrls, if the URL starts with the sites baseURL we add it straight to our slice. If the URL begins with “/”, we do some string formatting to get the absolute URL in question. Should the URL not belong to the domain we are crawling we simply skip it.

Crawling A Page

We can then start bring all of our work together with a function that crawls a single page. This function takes a number of arguments, we pass in our base URL and the URL we want to scrape. We also pass in the parser we have defined in our main.go function. We also pass in a channel of empty structs, which we use as a semaphore. This allows us to limit the number of requests we make in parallel, as reading from a channel in the above manner is blocking.

We make our requests, then create a goquery Document from the response. This document is used by both our ParsePage function and our extractLinks function. We then resolve the found URLs, before returning them and the results found by the our parser.

Getting Our Base URL

We can pull out our baseURL by using the net/url package’s Parse function. This allows us to simply parse our start URL into our main Crawl function. After we parse the URL, we simply join together the scheme and host using basic string formatting.

Crawl Function

Our crawl function brings together all the other functions we have written and contains quite a lot of it’s own logic. We begin by creating a empty slice of ScrapeResult’s. We then create a workList channel which will contain a list of URLs to scrape. We also initialize an integer value and set it to one. We also create a channel of tokens which will be passed into our crawl page function and limit the total concurrency as defined when we launch the crawler. We then parse our start URL, to get our baseDomain which is used in multiple places within our crawling logic.

Our main for loop is rather complicated. But we essentially create a new goroutine for each item, in our work list. This doesn’t mean we scrape every page at once, due to the fact that we use our tokens channel as a semaphore. We call our crawlPage function, pulling out the results from our parser and all the internal links found. These foundLinks are then put into our workList and the process continues until we run out of new links to crawl.

Our main.go file

We can then write a very simple main.go function where we create an instance of our parser. Then simply call our Crawl function, and watch our crawler go out and collect results. It should be noted that the crawler is very fasted and should be used with very low levels of concurrency in most instances. The democrawl repo can be found on my Github, feel free to use the code and expand and modify it to fit your needs.

Writing a Web Crawler with Golang and Colly

This blog features multiple posts regarding building Python web crawlers, but the subject of building a crawler in Golang has never been touched upon. There are a couple of frameworks for building web crawlers in Golang, but today we are going to look at building a web crawler using Colly. When I first started playing with the framework, I was shocked how quick and easy it was to build a highly functional crawler with very few lines of Go code.

In this post we are going to build a crawler, which crawls this site and extracts the URL, title and code snippets from every Python post on the site. To write such a crawler we only need to write a total of 60 lines of code! Colly requires an understanding of CSS Selectors which is beyond the scope of this post, but I recommend you take a look at this cheat sheet.

Setting Up A Crawler

To begin with we are going to set up our crawler and create the data structure to store our results in. First, of all we need to install Colly using the go get command. Once this is done we create a new struct which will represent an article, and contains all the fields we are going to be collecting with our simple example crawler.

With this done, we can begin writing our main function. To create a new crawler we must create a NewCollector, which itself returns a Collector instance. The NewCollector function takes a list of functions which are used to initialize our crawler. In our case we are only calling one function within our NewCollector function, which is limiting our crawler to pages found on “edmundmartin.com”.

Having done this we then place some limits on our crawler. As Golang, is a very performant and many websites are running on relatively slow servers we probably want to limit the speed of our crawler. Here, we are setting up a limiter which matches everything contains “edmundmartin” in the URL. By setting the parallelism to 1 and setting a delay of a second, we are ensuring that we only crawl one URL a second.

Basic Crawling Logic

To collect data from our target site, we need to create a clone of our Colly collector. We also create a slice of our ‘Article’ struct to store the results we will be collecting. We also add a callback to our crawler which will fire every time we make a new request, this callback just prints the URL which are crawler will be visiting.

We then add another “OnHTML” callback which is fired every time the HTML is returned to us. This is attached to our original Colly collector instance and not the clone of the Collector. Here we pass in CSS Selector, which pulls out all of the href’s on the page. We can also use some logic contained within the Colly framework which allows us to resolve to URL in question. If URL contains ‘python’, we submit it to our cloned to Collector, while if ‘python’ is absent from the URL we simply visit the page in question. This cloning of our collector allows us to define different OnHTML parsers for each clone of original crawler.

Extracting Details From A Post

We can now add an ‘OnHTML’ callback to our ‘detailCollector’ clone. Again we use a CSS Selector to pull out the content of each post contained on the page. From this we can extract the text contained within the post’s “H1” tag. We finally, then pick out all of the ‘div’ containing the class ‘crayon-main’, we then iterate over all the elements pulling out our code snippets. We then add our collected data to our slice of Articles.

All there is left to do, is start of the crawler by calling our original collector’s ‘Visit’ function with our start URL. The example crawler should finish within around 20 seconds. Colly makes it very easy to write powerful crawlers with relatively little code. It does however take a little while to get used the callback style of the programming.

Full Code

 

Scraping Google with Golang

I have previously written a post on scraping Google with Python. As I am starting to write more Golang, I thought I should write the same tutorial using Golang to scrape Google. Why not scrape Google search results using Google’s home grown programming language.

Imports & Setup

This example will only being using one external dependency. While it is possible to parse HTML using Go’s standard library, this involves writing a lot of code. So instead we are going to be using the very popular Golang library, Goquery which supports JQuery style selection of HTML elements.

Defining What To Return

We can get a variety of different information from Google, but we typically want to return a result’s position, URL, title and description. In Golang it makes sense to create a struct representing the data we want to be gathered by our scraper.

We can build a simple struct which will hold an individual search result, when writing our final function we can then set the return value to be a slice of our GoogleResult struct. This will make it very easy for us to manipulate our search results once we have scraped them from Google.

Making A Request To Google

To scrape Google results we have to make a request to Google using a URL containing our search parameters. For instance Google allows you to pass a number of different parameters to a search query. In this particular example we are going to write a function that will generate us a search URL with our desired parameters.

But first we are going to define a “map” of supported Google geo locations. In this post we are only going to support a few major geographical locations, but Google operates in over 100 different geographical locations.

This will allow pass a two letter country code to our scraping function and scrape results from that particular version of Google. Using the different base domains in combination with a language code allows us to scrape results as they appear in the country in question.

We then write a function that allows us to build a Google search URL. The function takes in three arguments, all of the string type and returns a URL also a string. We first trim the search term to remove any trailing or proceeding white-space.  We then replace any of the remaining spaces with ‘+’, the -1 in this line of code means that we replace every-single remaining instance of white-space with a plus.

We then look up the country code passed as an argument against the map we defined earlier. If the countryCode is found in our map, we use the respective URL from the map, otherwise we use the default ‘.com’ Google site. We then use the format packages “Sprintf” function to format a string made up of our base URL, our search term and language code. We don’t check the validity of the language code, which is something we might want to do if we were writing a more fully featured scraper.

We can now write a function to make a request. Go has a very easy to use and power “net/http” library which makes it relatively easy to make HTTP requests. We first get a client to make our request with. We then start building a new HTTP request which will be eventually executed using our client. This allows us to set custom headers to be sent with our request. In this instance we our replicating the User-Agent header of a real browser.

We then execute this request, with the client’s Do method returning us a response and error. If something went wrong with the request we return a nil value and the error. Otherwise we simply return the response object and a nil value to show that we did not encounter an error.

Parsing the Result

Google results are divided up to ‘div’ elements with the class ‘g’.

Now we move onto parsing the result of request. Compared with Python the options when it comes to HTML parsing libraries is not as robust, with nothing coming close to the ease of use of BeautifulSoup. In this example, we are going to use the very popular Goquery package which uses JQuery style selectors to allow users to extract data from HTML documents.

We generate a goquery document from our response, and if we encounter any errors we simply return the error and a nil value object. We then create an empty slice of Google results which we will eventually append results to. On a Google results page, each organic result can be found in ‘div’ block with the class of ‘g’. So we can simply use the JQuery selector “div.g” to pick out all of the organic links.

We then loop through each of these found ‘div’ tags finding the link and it’s href attribute, as well as extracting the title and meta description information. Providing the link isn’t an empty string or a navigational reference, we then create an GoogleResult struct holding our information. This can then be appended to the slice of structs which we defined earlier. Finally, we increment the rank so we can tell the order in which the results appeared on the page.

Wrapping It All Up

We can then simply write a function which encompasses all the previous functions. Note when writing this function we capitalise the function name to ensure that it is exported. This will allow us to use this function in other Go programs. We don’t do much in the way of error handling, if anything goes wrong we simply return nil and the error, without any logging along the way. Ideally, we would want some sort of logging to alert us exactly what went wrong with a particular scrape.

Example Usage

The above program makes use of our GoogleScraper function by working through a list of keywords and scraping search results. After each scrape we are waiting a total of 30 seconds, this should help us avoid being banned. Should we want to scrape a larger set of keywords, we would want to randomise our User-Agent and change up the proxy we were using in each request. Otherwise we are very likely to run into a Google captcha which would prevent us from gathering any results.

The full Google scraping script can be found here. Feel free to play with it and think about some of the additional functionality that could be added. You might for instance want to scrape the first few pages of Google, or pass in a custom number of results to be returned by the script.

Golang – Random User-Agent’s & Proxies

When scraping the internet it often makes sense to rotate both the proxy and user agent sent along with the HTTP request. By rotating proxies and user agents we can decrease detection, and avoid having our IP addresses banned, or face being rate limited by the site in question. This is relatively easy to do in Golang, though it is not quite as simple as in Python.

Generating A Random Choice

First we need to write a small function which takes a slice of string and returns a string. We first seed Go’s random generating with the computer’s current Unix time. Strictly speaking this will make our results only pseudo-random, as it would be possible to produce the same result given the same seed. This doesn’t matter as pseudo-randomness is good enough for the purposes of web scraping.

Creating A Client

Making a HTTP request in Go requires that we create a client to make the request with. The above function takes in the URL we are looking to scrape, our slice of proxies and our slice of User-Agent strings. We then use our RandomString function to pick both our proxy and our User-Agent. The parsed proxy string is then passed into our client by creating a custom transport using the said proxy. We then create a request to be executed by the client and add our custom user agent header to the request. We then use the client to make the specified request and return the HTTP response should everything go fine.

Example Usage

We can then use the function as above. By simply creating a slice of strings to hold both our User-Agents and our proxies, we can then pass these slices straight to the function. In the example, we don’t do anything interesting with the response but you will be able to manipulate the http.response anyway you want. The full code usded can be found here.