Scraping Google with Python

In this post we are going to look at scraping Google search results using Python. There are a number of reasons why you might want to scrape Google’s search results. Some people scrape these results to determine how their sites are performing in Google’s organic rankings, while others use the data to look for security weaknesses, with their being plenty of different things you can do with the data available to you.

Scraping Google

Google allows users to pass a number of parameters when accessing their search service. This allows users to customise the results we receive back from the search engine. In this tutorial, we are going to write a script allowing us to pass a search term, number of results and a language filter.

Requirements

There a couple of requirements we are going to need to build our Google scraper. Firstly, you are going to need Python3. In addition to Python 3, we are going to need to install a couple of popular libraries; namely requests and Bs4. If you are already a Python user, you are likely to have both these libraries installed.

Grabbing Results From Google

First, we are going to write a function that grabs the HTML from a Google.com search results page. The function will take three arguments. A search term, the number of results to be displayed and a language code.

The first two lines our our fetch_results function assert whether the provided search term is a string and whether the number of results argument is an integer. This will see our function throw an Assertion Error, should the function be called with arguments of the wrong type.

We then escape our search term, with Google requiring that search phrases containing spaces be escaped with a addition character. We then use string formatting to build up a URL containing all the parameters originally passed into the function.

Using the requests library, we make a get request to the URL in question. We also pass in a User-Agent to the request to avoid being blocked by Google for making automated requests. Without passing a User-Agent to a request, you are likely to be blocked after only a few requests.

Once we get a response back from the server, we raise the response for a status code. If all went well the status code returned should be 200 Status OK. If however, Google has realised we are making automated requests we will be greeted by a captcha and 503 Forbidden page. If this happens an exception will be raised. Finally, our function returns the search term passed in and the HTML of the results page.

Parsing the HTML

Now we have grabbed the HTML we need to parse this html. Parsing the HTML, will allow us to extract the elements we want from the Google results page. For this we are using BeautifulSoup, this library makes it very easily to extract the data we want from a webpage.

All the organic search results on the Google search results page are contained within ‘div’ tags with the class of ‘g’. This makes it very easy for us to pick out all of the organic results on a particular search page.

Our parse results function begins by making a ‘soup’ out of the html we pass to it. This essentially just creates a DOM object out of a HTML string allowing to select and navigate through different page elements. When then initialise our results variable, which is going to be a list of dictionary elements. By making the results a list of dictionary elements we make it very easy to use the data in variety of different ways.

We then pick out of the results block using the selector already mentioned. Once we have these results blocks we iterate through the list, where try and pick out the link, title and description for each of our blocks. If we find both a link and title, we know that we have an organic search block. We then grab the href element of the link and the text of the description. Provided our found link is not equal to ‘#’, we simply add a dictionary element to our found results list.

Error Handling

We are now going to add error handling. There are a number of different errors that could be thrown and we look to catch all of these possible exceptions. Firstly, if you pass data for the wrong type to the fetch results function, an assertion error will be thrown. This function can also throw two more errors. Should we get banned we will be presented with a HTTP Error and should we have some sort of connection issue we will catch this using the generic requests exception.

We can then use this script in a number of different situations to scrape results from Google. The fact that our results data is a list of dictionary items, makes it very easy to write the data to CSV, or write to the results to a database. The full script can be found here.

23 thoughts to “Scraping Google with Python”

    1. You should be add this to the script. But depending on how you want to return your results you might have to make some other edits.

  1. Hello, I am looking to connect this script to a SQLite database using Flask on a Linux server. Do you have any tips on connecting this script to a SQLite DB?

    1. Hello,

      It should be pretty easy to do. As the result is simply a list of dictionaries, it should be simply to insert into an SQLite DB with a ORM such as peewee or SQLalchemy. How this is done depends on how you have opted to layout the app and what database technology, you have ultimately opted for.

    1. A User-Agent is simply a string which you display when you make HTTP requests. The User-Agent helps websites identify your browser and operating system, and give sites the ability to customize the experience based on the features of your User-Agent. By default the requests library users a header which identifies itself as the Python requests library. That makes it very easy for websites to simply block requests using this header. We can modify our headers to make it appear that we are using a real browser when making requests, but there are still ways of detecting requests made with a library such as requests, as opposed to with a real browser.

      1. Google will block you, if it deems that you are making automated requests. Google will do this regardless of the method of scraping, if your IP address is deemed to have made too many requests. There are two main ways to tackle this. One option is simply to sleep for a significant amount of time between each request. Sleeping 30-60 seconds between each request will allow you to query hundreds of keywords in my personal experience. This obviously slows you down. Second option is use to a variety of different proxies to make your requests with. By switching up the proxy used you are able to consistently extract results from Google. The faster you want to go the more proxies you are going to need.

        1. Thanks for the tutorial. I’m wondering whether you could point me in the right direction to find some resources on using randomly varied proxies in Python.

          1. Unfortunately, I don’t have any resources on such a topic. There are however a number of services that provide people with a rotating proxy service, using just one proxy. These tend to be quite unreliable though. If you have a bunch of proxies it is quite easy to write a small service or script which rotates through them.

  2. Thanks for the reply. For some reason, it wouldn’t let me response to your reply… I actually solved my own problem, using this package , which scrapes 4 websites to generate a list of presently available public proxies, as well as randomizes your user agent header. The modification to your code is simple. Once you’ve installed http-request-randomizer, you can wrap it in a simple function which is callable from your fetch_results() function:

    from http_request_randomizer.requests.proxy.requestProxy import RequestProxy

    def pRequest(url):
    req_proxy = RequestProxy()
    x = 0
    while x == 0:
    response = req_proxy.generate_proxied_request(url)
    if response is not None:
    return response
    x = 1
    time.sleep(10)

    def fetch_results(search_term, number_results, language_code):
    assert isinstance(search_term, str), ‘Search term must be a string’
    assert isinstance(number_results, int), ‘Number of results must be an integer’
    escaped_search_term = search_term.replace(‘ ‘, ‘+’)
    google_url = ‘https://www.google.com/search?q={}&num={}&hl={}’.format(escaped_search_term, number_results, language_code)
    response = pRequest(google_url)
    response.raise_for_status()
    return search_term, str(u”.join(response.text).encode(‘utf-8’))

    Also, you might want to add some error handling at line 15 of parse_results(), as some results did not have descriptions which matched the soup descriptions parameters you set out (i.e. some of the summary things that show up at the top of some searches, which was causing the code to fail for some search terms. I solved (very simply) by replacing line 15 of parse_results() with:
    try:
    description = description.get_text()
    except AttributeError:
    description = ”

    1. I have updated the code to fix the error. It’s better to do the following though:
      if description:
      description = description.get_text()

      Catching an exception is not really necessary here, and more costly from a performance perspective.

        1. It shouldn’t do. Bs4 returns None by default if no element is found. It’s perfectly safe to add None to a dictionary or list object, though remember to avoid NoneType errors when accessing the contents of your chosen data structure.

    2. Hi, I’ve been having issues with the package you used above. I found this site (https://www.my-proxy.com/blog/google-proxies-dead) saying Google will block public proxies. Is this still working for you? Or have you had to get around it other ways? When I use the package I get a Nonetype returned to me. My goal is to get information from the results page every 30 seconds. Thanks!

    1. The selector used in the script should pick up the descriptions as they are displayed in the search result page. Have you tried changing the description selector?

  3. I would suggest the following replacement (assuming Python 3):
    Change
    escaped_search_term = search_term.replace(‘ ‘, ‘+’)
    To
    from urllib.parse import urlencode
    escaped_search_term = urlencode(search_term)

  4. Thx for posting this, it looks like it might help me a lot! I’ve gotten the sample code to work (am handwaving right now over issues with proxies and delays). I combined the section into one file, but when I run it, the results look like an “unprocessed” soup (showing part of a .5 mg output):
    edmund martin – Google Search(function(){window.google={kEI:’khSIW76cK4mqtQX6o6TwBQ’,kEXPI:’31’,authuser:0,kscs:’c9c918f0_khSIW76cK4mqtQX6o6TwBQ’,kGL:’US’};google.kHL=’en’;})();google.time=function(){return(new Date).getTime()};google.timers={};google.startTick=function(c,b){var a=b&&google.timers[b].t?google.timers[b].t.start:google.time();google.timers[c]={t:{start:a},e:{},m:{}};

    Q: how to fix? Is the assumption that the “Grabbing Results From Google” section is output to a file that is then used the “Parsing the HTML” or error processing sections?

    I can find URLs from a manual google search in the output, so am assuming it’s something I’m doing wrong with the parse section. THX!

    1. Hello, thanks for your comment. It’s hard to tell without seeing all of your code. I think it’s likely something went wrong when you combined the snippets together. You can find a link to the complete script here. Let me know if that helps?

    1. You need to scale back the rate at which you are scraping Google and sleep between each request you make. Or alternatively you can make use of proxies and rotate them between requests.

Leave a Reply

Your email address will not be published. Required fields are marked *