Scraping Google with Python

In this post we are going to look at scraping Google search results using Python. There are a number of reasons why you might want to scrape Google’s search results. Some people scrape these results to determine how their sites are performing in Google’s organic rankings, while others use the data to look for security weaknesses, with their being plenty of different things you can do with the data available to you.

Scraping Google

Google allows users to pass a number of parameters when accessing their search service. This allows users to customise the results we receive back from the search engine. In this tutorial, we are going to write a script allowing us to pass a search term, number of results and a language filter.

Requirements

There a couple of requirements we are going to need to build our Google scraper. Firstly, you are going to need Python3. In addition to Python 3, we are going to need to install a couple of popular libraries; namely requests and Bs4. If you are already a Python user, you are likely to have both these libraries installed.

Grabbing Results From Google

First, we are going to write a function that grabs the HTML from a Google.com search results page. The function will take three arguments. A search term, the number of results to be displayed and a language code.

The first two lines our our fetch_results function assert whether the provided search term is a string and whether the number of results argument is an integer. This will see our function throw an Assertion Error, should the function be called with arguments of the wrong type.

We then escape our search term, with Google requiring that search phrases containing spaces be escaped with a addition character. We then use string formatting to build up a URL containing all the parameters originally passed into the function.

Using the requests library, we make a get request to the URL in question. We also pass in a User-Agent to the request to avoid being blocked by Google for making automated requests. Without passing a User-Agent to a request, you are likely to be blocked after only a few requests.

Once we get a response back from the server, we raise the response for a status code. If all went well the status code returned should be 200 Status OK. If however, Google has realised we are making automated requests we will be greeted by a captcha and 503 Forbidden page. If this happens an exception will be raised. Finally, our function returns the search term passed in and the HTML of the results page.

Parsing the HTML

Now we have grabbed the HTML we need to parse this html. Parsing the HTML, will allow us to extract the elements we want from the Google results page. For this we are using BeautifulSoup, this library makes it very easily to extract the data we want from a webpage.

All the organic search results on the Google search results page are contained within ‘div’ tags with the class of ‘g’. This makes it very easy for us to pick out all of the organic results on a particular search page.

Our parse results function begins by making a ‘soup’ out of the html we pass to it. This essentially just creates a DOM object out of a HTML string allowing to select and navigate through different page elements. When then initialise our results variable, which is going to be a list of dictionary elements. By making the results a list of dictionary elements we make it very easy to use the data in variety of different ways.

We then pick out of the results block using the selector already mentioned. Once we have these results blocks we iterate through the list, where try and pick out the link, title and description for each of our blocks. If we find both a link and title, we know that we have an organic search block. We then grab the href element of the link and the text of the description. Provided our found link is not equal to ‘#’, we simply add a dictionary element to our found results list.

Error Handling

We are now going to add error handling. There are a number of different errors that could be thrown and we look to catch all of these possible exceptions. Firstly, if you pass data for the wrong type to the fetch results function, an assertion error will be thrown. This function can also throw two more errors. Should we get banned we will be presented with a HTTP Error and should we have some sort of connection issue we will catch this using the generic requests exception.

We can then use this script in a number of different situations to scrape results from Google. The fact that our results data is a list of dictionary items, makes it very easy to write the data to CSV, or write to the results to a database. The full script can be found here.

Log File Parsing for SEO’s

What is a log file?

Every time a request is made to a server this request is logged in the site’s log files. This means that log files record every single request whether it is made by a bot or by a real visitor to your site.

These logs record the following useful information:

  • IP Address of the visitor
  • The Date & Time of the request
  • The resource requested
  • The status code of this resource
  • The number of bytes downloaded

Why does this matter to SEO’s?

SEO’s need to know how search engine spiders are interacting with their sites. Google Search console provides some limited data including the number of pages visited and the amount of information downloaded. But this limited information doesn’t give us any real insight into how Googlebot is interacting with different templates, and different directories.

Looking at this log file data often turns up some interesting insights. When looking at one client’s site, we discovered that Googlebot was spending a large portion of it’s time crawling pages which drove only 3% of overall organic traffic. These pages where essentially eating massively into the site’s overall crawl budget, leading to a decrease in overall organic traffic. This data point would simply not be available to someone who just looked at the top level crawl data contained in Google’s Search Console.

What Can We Discover in Log Files

Log files allow us to discover what resources are being requested by Google. This can help us achieve the following:

  • See how frequently important URLs are crawled
  • See the full list of errors discovered by Googlebot during crawls
  • See whether Google is fully rendering pages using front-end frameworks such as React and AngularJS
  • See whether site structure can be improved to ensure commercially valuable URLs are crawled
  • Check implementation of Robots.txt rules, ensuring blocked pages are not being crawled and that certain page types are not being unexpectedly blocked.

Options Available 

There are a number of different options available for SEO’s who want to dive into their log file data.

Analysing Data in Excel

As log files are essentially space separated text files, it’s possible to open these log files up in Excel and then analyse them. This can be fine when you are working with a very small site, but with bigger sites it isn’t a viable option. Just verifying whether traffic is really from Googlebot can be quite a pain and Excel is not really designed for this type of analysis.

Commercial Options

There are a number of commercial options available for those who want to undertake analysis of their log files. Botify offer a log file analysis service to users depending on their subscription package. Of course the Botify subscription is very pricey, and is not going to be a viable option for many SEO’s.

Screaming Frog also offer a log file parsing tool, which is available for £99 a year. As with Screaming Frog’s SEO Spider the mileage you will get with this tool really depends on the size of your available RAM and how big the site you are dealing with is.

Parsing Log Files with Python

Log files are plain text files and can easily be parse by Python. The valuable information contained on each line of a log file is delimited by a space. This means that we are able to run through the file line by line, and pull out the relevant data.

There are a couple of challenges when parsing a log file for SEO purposes:

  • We need to pull out Googlebot results and verify that they are actually from Google IP address.
  • We need to normalise date formats to make our analysis of our log files easier.

I have written a Python script which deals with both of these issues and outputs the underlying data to a SQLite database. Once we have the data into SQL format, we can either export this into a CSV file for further analysis or query the database to pull out specific information.

Our Script

The above script can be used without any Python knowledge, by simply moving the script to the directory where you have downloaded your log files. On the last line of the script, the user changes out the ‘example-extension’ to the extension that their log files have been saved with and updates ‘exampledomain.com’ with the domain in question.

As the above script is not reliant on libraries outside of Python’s standard library, anyone with Python should be able to save the script and simply run it.