Golang – Random User-Agent’s & Proxies

When scraping the internet it often makes sense to rotate both the proxy and user agent sent along with the HTTP request. By rotating proxies and user agents we can decrease detection, and avoid having our IP addresses banned, or face being rate limited by the site in question. This is relatively easy to do in Golang, though it is not quite as simple as in Python.

Generating A Random Choice

First we need to write a small function which takes a slice of string and returns a string. We first seed Go’s random generating with the computer’s current Unix time. Strictly speaking this will make our results only pseudo-random, as it would be possible to produce the same result given the same seed. This doesn’t matter as pseudo-randomness is good enough for the purposes of web scraping.

Creating A Client

Making a HTTP request in Go requires that we create a client to make the request with. The above function takes in the URL we are looking to scrape, our slice of proxies and our slice of User-Agent strings. We then use our RandomString function to pick both our proxy and our User-Agent. The parsed proxy string is then passed into our client by creating a custom transport using the said proxy. We then create a request to be executed by the client and add our custom user agent header to the request. We then use the client to make the specified request and return the HTTP response should everything go fine.

Example Usage

We can then use the function as above. By simply creating a slice of strings to hold both our User-Agents and our proxies, we can then pass these slices straight to the function. In the example, we don’t do anything interesting with the response but you will be able to manipulate the http.response anyway you want. The full code usded can be found here.

Scraping Google with Python

In this post we are going to look at scraping Google search results using Python. There are a number of reasons why you might want to scrape Google’s search results. Some people scrape these results to determine how their sites are performing in Google’s organic rankings, while others use the data to look for security weaknesses, with their being plenty of different things you can do with the data available to you.

Scraping Google

Google allows users to pass a number of parameters when accessing their search service. This allows users to customise the results we receive back from the search engine. In this tutorial, we are going to write a script allowing us to pass a search term, number of results and a language filter.

Requirements

There a couple of requirements we are going to need to build our Google scraper. Firstly, you are going to need Python3. In addition to Python 3, we are going to need to install a couple of popular libraries; namely requests and Bs4. If you are already a Python user, you are likely to have both these libraries installed.

Grabbing Results From Google

First, we are going to write a function that grabs the HTML from a Google.com search results page. The function will take three arguments. A search term, the number of results to be displayed and a language code.

The first two lines our our fetch_results function assert whether the provided search term is a string and whether the number of results argument is an integer. This will see our function throw an Assertion Error, should the function be called with arguments of the wrong type.

We then escape our search term, with Google requiring that search phrases containing spaces be escaped with a addition character. We then use string formatting to build up a URL containing all the parameters originally passed into the function.

Using the requests library, we make a get request to the URL in question. We also pass in a User-Agent to the request to avoid being blocked by Google for making automated requests. Without passing a User-Agent to a request, you are likely to be blocked after only a few requests.

Once we get a response back from the server, we raise the response for a status code. If all went well the status code returned should be 200 Status OK. If however, Google has realised we are making automated requests we will be greeted by a captcha and 503 Forbidden page. If this happens an exception will be raised. Finally, our function returns the search term passed in and the HTML of the results page.

Parsing the HTML

Now we have grabbed the HTML we need to parse this html. Parsing the HTML, will allow us to extract the elements we want from the Google results page. For this we are using BeautifulSoup, this library makes it very easily to extract the data we want from a webpage.

All the organic search results on the Google search results page are contained within ‘div’ tags with the class of ‘g’. This makes it very easy for us to pick out all of the organic results on a particular search page.

Our parse results function begins by making a ‘soup’ out of the html we pass to it. This essentially just creates a DOM object out of a HTML string allowing to select and navigate through different page elements. When then initialise our results variable, which is going to be a list of dictionary elements. By making the results a list of dictionary elements we make it very easy to use the data in variety of different ways.

We then pick out of the results block using the selector already mentioned. Once we have these results blocks we iterate through the list, where try and pick out the link, title and description for each of our blocks. If we find both a link and title, we know that we have an organic search block. We then grab the href element of the link and the text of the description. Provided our found link is not equal to ‘#’, we simply add a dictionary element to our found results list.

Error Handling

We are now going to add error handling. There are a number of different errors that could be thrown and we look to catch all of these possible exceptions. Firstly, if you pass data for the wrong type to the fetch results function, an assertion error will be thrown. This function can also throw two more errors. Should we get banned we will be presented with a HTTP Error and should we have some sort of connection issue we will catch this using the generic requests exception.

We can then use this script in a number of different situations to scrape results from Google. The fact that our results data is a list of dictionary items, makes it very easy to write the data to CSV, or write to the results to a database. The full script can be found here.

Aiohttp – Background Tasks

Python gets a lot of flak for its performance story. However, the introduction of Aysncio into the standard library goes someway to resolving some of those performance problems. There is now a wide choice of libraries which make use of the new async/await syntax, including a number of server implementations.

The Aiohttp library comes with both a client and server. However, today I want to focus on the server and one of my favourite features – background tasks. Typically, when building a Python based micro-service with Flask, you might have a background task running in something like Celery. While these background tasks are more limited than Celery tasks, they allow you to run tasks in the background while still receiving requests.

A Simple Example

I have written code, which provides you with a simple example of how you can use such a background task. We are going to write a server that has one endpoint. This endpoint allows a user to post a JSON dictionary containing a URL. This URL is then sent to a thread pool where it is immediately scraped without blocking the users request. Once we have the data we need from the URL this placed in queue which is then processed by our background task which simply posts the data to another endpoint.

Get & Post Requests

For this we are going to need to implement a post and get request method. Our get request is going to be run in a thread pool so we can use the ever-popular requests library to grab our page. However, our post-request is going to be made inside an async background task, so itself must be asynchronous. Otherwise we would end up blocking the event loop.

Both of these small functions are pretty basic and don’t do much in the way of error or logging, but are enough to demonstrate the workings of a background task.

Initialising the Server and Setting Up Our Route

We begin by initialising our server class by passing a port and host. We also define a thread pool, which will use to run our synchronous get requests. If you had a long running CPU bound task, you could instead use a process pool in much the same way. We finally create a queue using deque from the collections module, allowing us to easily append and pop data from our queue. It is this queue that our background task will process. Finally, we have the example endpoint which we will use to post our data off to.

We then move onto defining our async view. This particular view is very simple, we simply await the JSON from the incoming request. Then we attempt to grab the URL from the provided JSON. If the JSON contains a URL, we send the URL to our get_request function which is then executed within the thread pool. This allows us to return a response to the person making a request without blocking. We add a call back to this which will be executed, once the request is completed. The callback simply puts the data in our queue which will be processed by our background task.

Creating & Registering The Background Task

Our background task is very simple. It is simply an async function which contains a while True loop. Inside this loop we check if there are any items in the queue to be post to our dummy server. If there are any items, we pop these items and make an async post request. If there are no items we await asyncio.sleep. This is very important. Without putting the await statement here, we could end up in a situation where our background task never gives up the event loop to incoming server requests.
We then define two async functions which simply take in our yet to be created app and then add a task to the event loop. This allows the background task to be run in the same event loop as the server and cancel when the server itself is shut down.

The Most Complicated Bit: Creating Our App

This part of the code is the most confusing. Our create app function simply returns a web app with our route added to the server’s homepage. In the run app method we then run this application forever within our event loop. Appending the tasks which are to be run on start up and shut down of the server. We can then finally pass our app to the web.run_app function to be run on our specified host and port.

Complete Code

We now have a simple server which takes requests, deals with them and then processes them in the background. This can be very powerful and can be used to create servers which can process long running tasks by using these tasks in conjunction with thread and process pools.

Selenium Based Crawler in Python

Today, we are going to walk through creating a basic crawler making use of Selenium.

Why Build A Selenium Web Crawler?

First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.

Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.

While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.

Imports & Class Initialisation

To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.

We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.

Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘https://edmundmartin.com’ to the base_url argument.

We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.

We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.

Getting Pages With Selenium

We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.

Creating a Soup

This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.

Getting Links

Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.

Getting Data

We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.

Writing to CSV

We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.

Run Crawler Method

The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.

Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.

We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.

What Can Be Improved?

There are a number of things that can be improved on in this example Selenium crawler.

While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.

Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.

The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.

Beautiful Soup vs. lxml – Speed

When comparing Python parsing frameworks, you often hear people complaining that Beautiful Soup is considerably slower than using lxml. Thus, some people conclude that lxml should be used in any performance critical project. Having used Beautiful Soup in a large number of web scraping projects and never having had any real trouble with its performance, I wanted to properly measure the performance of the popular parsing library.

The Test

To test the two libraries, I wrote a simple single threaded crawler which crawls a total of 100 URLs and then simply extracts links and the page title from the page in question. By implementing two different parser methods, one using lxml and one using Beautiful Soup. I also tested the speed of the Beautiful Soup with various non-default parsers.

Each of the various setups were tested a total of five times to account for varying internet and server response times, with the below results outlining the different performance based on library and underlying parser.

The Results

Run #1 Run #2 Run #3 Run #4 Run #5 Avg. Speed Overhead Per Page (Seconds)
lxml 38.49 36.82 39.63 39.02 39.84 38.76 N/A
Bs4 (html.parser) 49.17 52.1 45.35 47.38 47.07 48.21 0.09
Bs4 (html5lib) 54.61 53.17 53.37 56.35 54.12 54.32 0.16
Bs4 (lxml) 42.97 43.71 46.65 44.51 47.9 45.15 0.06

As you can see lxml is significantly faster than Beautiful Soup. A pure lxml solution is several seconds faster than using Beautiful Soup with lxml as the underlying parser. The built Python parsing library is around 10 seconds slower, whereas the extremely liberal html5lib is even slower. The overhead per page parsed is still relatively small with both bs4(html.parser) and bs4(lxml) adding less than 0.1 seconds per page parsed.

Overhead 100,000 URLs (Extra Hours) 500,000 URLs (Extra Hours)
Bs4 (html.parser) 0.09454 2.6 13.1
Bs4 (html5lib) 0.15564 4.3 21.6
Bs4 (lxml) 0.06388 1.8 8.9

While the overhead seems very low, when you try and scale a crawler using Beautiful Soup will add a significant overhead. Even using Beautiful Soup with lxml adds significant overhead when you are trying to scale to hundreds of thousands of URLs. It should be noted that the above table assumes a crawler running a single thread. Anyone looking to crawl more than 100,000 URLs would be highly recommended to build a concurrent crawler making use of a library such as Twisted, Asycnio, or Concurrent futures.

So, the question of whether Beautiful Soup is suitable for your project really depends on the scale and nature of the project. Replacing Beautiful Soup with lxml is likely to see you achieve a small (but considerable at scale) performance improvements. This does however come at the cost of losing the Beautiful Soup API, which makes selecting on-page elements a breeze.

Web Scraping: Avoiding Detection

 

This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.

Switching User Agent’s

Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.

You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.

This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.

Other Request Headers

Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.

Going Slow

Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.

Proxies

In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.

Rendering JavaScript

JavaScript is pretty much used everywhere and the number of human’s not enabling JavaScript in their browsers is less than <1%. This means that some sites have looked to block IPs making large numbers of requests without rendering JavaScript. The simple solution is just to render JavaScript, using a headless browser and browser automation suite such as Selenium.

Increasingly, companies such as Cloudflare are checking whether users making requests to the site are rendering JavaScript. By using this technique, they hope to block bots making requests to the site in question. However, several libraries now exist which help you get around the kind of protection implemented by Cloudflare. Python’s cloudflare-scrape library is a wrapper around the requests library which simply run’s Cloudflare’s JavaScript test within a node environment should it detect that such a protection has been put in place.

Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.

Backing Off

What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.

Avoiding Honeypots/Bot Traps

Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.

Obeying Robots.txt

Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.

Cookies

In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.

Captchas

Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.

By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.

Concurrent Crawling in Python

Python, would seem the perfect language for writing web scrapers and crawlers. Libraries such as BeautifulSoup, Requests and lxml give programmers solid API’s to make requests and parse the data given back by web pages.

The only issue is that by default Python web scrapers and crawlers are relatively slow. This due to the issues that Python has with concurrency due to the languages GIL (Global Interpreter Lock). Compared with languages such as Golang and implementations of languages such as NodeJS building truly concurrent crawlers in Python is more challenging.

This lack of concurrency slows down crawlers due to your scripts simply idling while they await the response from the web server in question. This is particularly frustrating if some of the pages discovered are particularly slow.

In this post we are going to look at three different versions of the same script. The first version is going to lack any concurrency and simply request each of the websites one after the other. The second version makes use of concurrent futures’ thread pool executor allowing us to send concurrent requests by making use of threads. Finally, we are going to take a look at a version of the script using asyncio and aiohttp, allowing us to make concurrent requests by means of an event loop.

Non-Concurrent Scraper

A standard crawler/scraper using requests and BeautifulSoup is single threaded. This makes it very slow, as with every request have to wait for the server to respond before we can carry on with processing the results and moving onto the next URL.

A non-concurrent scraper is the simplest to code and involves the least effort. For many tasks such a crawler/scraper is more than enough for the task at hand.

The below code is an example of a very basic non-concurrent scraper which simply requests the page and grabs the title. It is this code that we will be expanding on during the post.

 

Concurrent Futures

Concurrent.futures is available as part of Python’s standard library and gives Python users a way to make concurrent requests by means of a ThreadPoolExecutor.

In the above example we initialise a class which takes a list of URLs and maximum number of threads as the initial argument. The class then has two hidden methods which handle making requests to the provided URLs and then simply parsing the titles from the HTML and returning the results to a dictionary.

These two methods are then placed in a wrapper which is then called in our run_script method. This where we get ThreadPoolExecutor involved creating a list of jobs from the URLs passed to the crawler on initialisation. We ensure that we are not starting up more threads than URLs in our list by using Python’s inbuilt min function. Python list comprehension is then used to submit the function and it’s arguments (a URL) to the executor. We then print the results of our simple crawl which have been collected in a dictionary format.

Asyncio & Aiohttp

Asyncio was introduced to the Python standard library in version 3.4. The introduction of Asyncio into Python’s standard library seriously improves Python’s concurrent credentials and there are already a number of community maintained packages expanding on the functionality of Asyncio. Using Asyncio & Aiohttp, is a little more complicated but offers increased power and even better performance.

What you will probably immediately notice about the above code is that we have written a number of  function definitions with ‘async’ prefaced to them. In Python 3.5, the asyncio library introduced these async def’s and they are just syntactic sugar for the older co-routine decorator that the library previously used.

Every time we want to write a function we intend to run asynchronously we need to either bring in the asyncio.coroutine decorator or append the async to our function definition.

The other noticeable difference is the ‘await’ keyword.  When calling an asynchronous function we must ‘await’ the result. This allows other functions to run at the same without blocking one another. Once we have made the HTTP request we await the response being read by our client which allows the event loop to make other outgoing requests.

Our handle task function simply gets a URL from the asnycio queue and then calls our other functions which make the request and deal with parsing the page. You will notice that when getting an item from the queue we have to await, just as with the calling of all other asnycio functions.

While looking more complicated the eventloop function begins by creating a queue and en-queues our URL list. We then establish a event loop and do a list comprehension passing items from the queue to our main function. We then simply pass this to the eventloop which then handles the execution of our code until there are no other URLs to handle.

Speed Comparisons

 

No-Concurrency Concurrent Futures Asyncio & Aiohttp
5 URLs 4.021 seconds 1.098 seconds 1.3197 seconds
50 URLs 79.2116 seconds 28.82 seconds 31.5012 seconds
100 URLs 157.5677 seconds 60.1970 seconds 45.4405 seconds

Running the above scripts using five threads where applicable. We can see that both of the concurrent scripts are far faster than our GIL blocking example and that at any large scale you would be recommended to go with a concurrent script.

Log File Parsing for SEO’s

What is a log file?

Every time a request is made to a server this request is logged in the site’s log files. This means that log files record every single request whether it is made by a bot or by a real visitor to your site.

These logs record the following useful information:

  • IP Address of the visitor
  • The Date & Time of the request
  • The resource requested
  • The status code of this resource
  • The number of bytes downloaded

Why does this matter to SEO’s?

SEO’s need to know how search engine spiders are interacting with their sites. Google Search console provides some limited data including the number of pages visited and the amount of information downloaded. But this limited information doesn’t give us any real insight into how Googlebot is interacting with different templates, and different directories.

Looking at this log file data often turns up some interesting insights. When looking at one client’s site, we discovered that Googlebot was spending a large portion of it’s time crawling pages which drove only 3% of overall organic traffic. These pages where essentially eating massively into the site’s overall crawl budget, leading to a decrease in overall organic traffic. This data point would simply not be available to someone who just looked at the top level crawl data contained in Google’s Search Console.

What Can We Discover in Log Files

Log files allow us to discover what resources are being requested by Google. This can help us achieve the following:

  • See how frequently important URLs are crawled
  • See the full list of errors discovered by Googlebot during crawls
  • See whether Google is fully rendering pages using front-end frameworks such as React and AngularJS
  • See whether site structure can be improved to ensure commercially valuable URLs are crawled
  • Check implementation of Robots.txt rules, ensuring blocked pages are not being crawled and that certain page types are not being unexpectedly blocked.

Options Available 

There are a number of different options available for SEO’s who want to dive into their log file data.

Analysing Data in Excel

As log files are essentially space separated text files, it’s possible to open these log files up in Excel and then analyse them. This can be fine when you are working with a very small site, but with bigger sites it isn’t a viable option. Just verifying whether traffic is really from Googlebot can be quite a pain and Excel is not really designed for this type of analysis.

Commercial Options

There are a number of commercial options available for those who want to undertake analysis of their log files. Botify offer a log file analysis service to users depending on their subscription package. Of course the Botify subscription is very pricey, and is not going to be a viable option for many SEO’s.

Screaming Frog also offer a log file parsing tool, which is available for £99 a year. As with Screaming Frog’s SEO Spider the mileage you will get with this tool really depends on the size of your available RAM and how big the site you are dealing with is.

Parsing Log Files with Python

Log files are plain text files and can easily be parse by Python. The valuable information contained on each line of a log file is delimited by a space. This means that we are able to run through the file line by line, and pull out the relevant data.

There are a couple of challenges when parsing a log file for SEO purposes:

  • We need to pull out Googlebot results and verify that they are actually from Google IP address.
  • We need to normalise date formats to make our analysis of our log files easier.

I have written a Python script which deals with both of these issues and outputs the underlying data to a SQLite database. Once we have the data into SQL format, we can either export this into a CSV file for further analysis or query the database to pull out specific information.

Our Script

The above script can be used without any Python knowledge, by simply moving the script to the directory where you have downloaded your log files. On the last line of the script, the user changes out the ‘example-extension’ to the extension that their log files have been saved with and updates ‘exampledomain.com’ with the domain in question.

As the above script is not reliant on libraries outside of Python’s standard library, anyone with Python should be able to save the script and simply run it.

Scraping difficult sites using private API’s

The Problem With The Modern Web

The widespread use of front-end JavaScript frameworks such as AngularJS and React is making the web more difficult to scrape using traditional techniques. When the content we want to access is being rendered after the initial request, simply making an old fashioned http request and then parsing the resulting content is not going to do us much good.

Browser Automation

Typically, those who are struggling to scrape data from ‘difficult to scrape’ sites resort to browser automation.  There are a myriad of different tools which allow developers to automate a browser environment. iMacros and Selenium are among the most popular tools used to extract data from these ‘difficult’ sites. A significant number of popular programming languages are supported by various ‘Selenium’ drivers. In the Python community, the standard response is that users should simply use Selenium to automate the browser of their choice and collect data that way.

Automating the browser presents it’s own challenge. Setting appropriate timeouts and ensuring that some expected error doesn’t stop your crawl in it’s tracks can be quite the challenge. In a lot of cases we can avoid the tricky task of browser automation and simply extract our data by leveraging underlying API’s.

Making Use of Private API’s


Ryan Mitchell, the author of ‘Web Scraping with Python‘ gave a very good talk on this very subject at DefCon 24. She talked extensively how the task of scraping ‘difficult’ websites can be avoided by simply looking to leverage the underlying API’s which power these modern web applications. She provided one specific example of a site, that a client had asked to scrape which had it’s own underlying API.

The site used by Ryan Mitchell as an example was the ‘Official Crossfit Affiliate’s Map’ site, which provides visitors to the Crossfit site with a way to navigate around a map containing the location of every single Crossfit affiliate. Anyone wanting to extract this information by automating the browser would likely have a torrid time trying to zoom and click on the various points on the map.

The site is in-fact powered by a very simple unprotected API. Every time an individual location is clicked on this triggers a request to the underlying API. This request returns a JSON file containing all of the required information. In order to discover this API endpoint, all one needs to do is have the developer console open while they play around with the map.

This makes our job particularly easy and we can collect the data from the map with a simple script. All we need to is iterate through each of the ID’s and then extract the required data from the JSON. We can then store this data as CSV, or in a simple SQLite database. This is just one specific example, but there are many other examples of sites where it is possible to pull useful information via a private API.

Instagram’s Private API

The Instagram web application also has a very easy to access API. Again discovering these API endpoint’s is not particularly difficult, you simply have to browse around the Instagram site with your console open to see the call’s made to the private Instagram API.

Once you are logged into Instagram, you are able to make requests to this API and receive significant amounts of data in return. This is particularly useful as the official Instagram API requires significant amounts of approval and vetting before it can be used beyond a limited number of trial accounts.

When using the browser with the developer console active, you will notice a number of XHR requests pop up in your console. Many of these can be manipulated to pull out useful and interesting data points.

https://www.instagram.com/graphql/query/?query_id=17851374694183129&id=18428658&first=4000

Using the above URL contains the user’s unique Instagram ID and allows to pull out the user’s most recent 4,000 followers. This can be used to discover and target individuals who follow a particular Instagram user. Should you be logged into the Instagram’s web platform, the above URL should return Kim Kardashian’s 4,000 most recent followers.

There are a number of other useful API’s easily accessible from the web application, but I will let the reader explore these for themselves.

Takeaways

Should you be looking to extract data from ‘difficult’ to scrape sites, it’s definitely worth looking for underlying private API’s. Often the data you need can be accessed from these API endpoints. Grabbing this well formatted and consistently data will save you a lot time and avoids some of the headaches associated with browser automation.

Random User-Agent in Requests (Python)

When using the Python requests library to extract data from websites, you may want to avoid detection and minimise the chances of your scraping activities being detected.

Setting a Custom User-Agent

To lower the chances of detention it is often recommended that users set a custom header. The requests library makes it very easy to set a custom user-agent. Often this is enough to avoid detection, with system administrators only looking for default user-agents when adding server side blocking rules.

Setting a Random User-Agent

If engaged in commercially sensitive scrapping, you may want to take additional precautions and randomise the User-Agent sent with each request.

The above snippet of code returns a random user-agent and Chrome’s default ‘Accept’ heading. When writing this code snippet I took efforts to include the ten most commonly used desktop browsers. It would probably be worth updating the browser list from time to time, to ensure that the user agents included in the list are up to date.

I have seen others loading a large list of hundreds and hundreds of user-agents. But I think this approach is misguided as it may see your crawlers make thousands of requests from very rarely used user agents.

Anyone looking closely at the ‘Accept’ headers will quickly realise that all of the different user agents are using the same ‘Accept’ header. Thankfully, the majority of system administrators completely ignore the intricacies of the sent ‘Accept’ headers and simply check if browsers are sending something plausible. Should it really be necessary, it would also be possible to send accurate ‘Accept’ headers with each request. I have never personally had to resort to this extreme measure.