Writing A Web Crawler in Golang

I have previously written a piece looking at how to write a web crawler using Go and popular framework Colly. However, it is relatively simple to write a relatively powerful web crawler in Golang without the help of any frameworks. In this post, we are going to write a web crawler using just Golang and the Goquery package to extract HTML elements. All in all, we can write a  fast but relatively basic web crawler in around a 130 lines of code.

Defining Our Parser Interface

First, we import all the packages we need from the standard library. We then pull in goquery, which we will use to extract data from the HTML returned by our crawler. If you don’t already have goquery, you will need to go grab it with go get.

When then define our our ScrapeResult struct, which contains some very simple data regarding the page. This could easily be expanded to return more useful information or to extract certain valuable information. When then define a Parser interface which allows users of our democrawl package to define their own parser to use with the basic crawling logic.

Making HTTP Requests

We are going to write a function which simply attempts to grab a page by making a GET request. The function simply takes in a URL, and makes a request using the default Googlebot agent, to hopefully avoid any detection. Should we encounter no issues, we simply return a pointer to the http.Response. Should something go wrong we return nil and the error thrown by the GET request.

Extracting Links And Resolving Relative URLs

Our crawl is going to restrict itself to crawling URLs found on the domain of our start URL. To achieve this, we are going to write two functions. Firstly, we are going to write a function which discovers all the links on a page. Then we will need a function to resolve relative URLs (URLs starting with “/”).

Our extract links function takes in a pointer to a goquery Document and returns a slice of string. This is relatively easy to do. We simply create a new slice of strings. Should we have passed in a document, we simply find each link element and extract it’s href attribute. This is then added to our slice of URLs.

We then have our resolveRelative function. As the name suggests this function resolves relative links and returns us a slice of all the internal links we found on a page. We simply iterate over our slice of foundUrls, if the URL starts with the sites baseURL we add it straight to our slice. If the URL begins with “/”, we do some string formatting to get the absolute URL in question. Should the URL not belong to the domain we are crawling we simply skip it.

Crawling A Page

We can then start bring all of our work together with a function that crawls a single page. This function takes a number of arguments, we pass in our base URL and the URL we want to scrape. We also pass in the parser we have defined in our main.go function. We also pass in a channel of empty structs, which we use as a semaphore. This allows us to limit the number of requests we make in parallel, as reading from a channel in the above manner is blocking.

We make our requests, then create a goquery Document from the response. This document is used by both our ParsePage function and our extractLinks function. We then resolve the found URLs, before returning them and the results found by the our parser.

Getting Our Base URL

We can pull out our baseURL by using the net/url package’s Parse function. This allows us to simply parse our start URL into our main Crawl function. After we parse the URL, we simply join together the scheme and host using basic string formatting.

Crawl Function

Our crawl function brings together all the other functions we have written and contains quite a lot of it’s own logic. We begin by creating a empty slice of ScrapeResult’s. We then create a workList channel which will contain a list of URLs to scrape. We also initialize an integer value and set it to one. We also create a channel of tokens which will be passed into our crawl page function and limit the total concurrency as defined when we launch the crawler. We then parse our start URL, to get our baseDomain which is used in multiple places within our crawling logic.

Our main for loop is rather complicated. But we essentially create a new goroutine for each item, in our work list. This doesn’t mean we scrape every page at once, due to the fact that we use our tokens channel as a semaphore. We call our crawlPage function, pulling out the results from our parser and all the internal links found. These foundLinks are then put into our workList and the process continues until we run out of new links to crawl.

Our main.go file

We can then write a very simple main.go function where we create an instance of our parser. Then simply call our Crawl function, and watch our crawler go out and collect results. It should be noted that the crawler is very fasted and should be used with very low levels of concurrency in most instances. The democrawl repo can be found on my Github, feel free to use the code and expand and modify it to fit your needs.

Leave a Reply

Your email address will not be published. Required fields are marked *