The Problem With The Modern Web
Typically, those who are struggling to scrape data from ‘difficult to scrape’ sites resort to browser automation. There are a myriad of different tools which allow developers to automate a browser environment. iMacros and Selenium are among the most popular tools used to extract data from these ‘difficult’ sites. A significant number of popular programming languages are supported by various ‘Selenium’ drivers. In the Python community, the standard response is that users should simply use Selenium to automate the browser of their choice and collect data that way.
Automating the browser presents it’s own challenge. Setting appropriate timeouts and ensuring that some expected error doesn’t stop your crawl in it’s tracks can be quite the challenge. In a lot of cases we can avoid the tricky task of browser automation and simply extract our data by leveraging underlying API’s.
Making Use of Private API’s
Ryan Mitchell, the author of ‘Web Scraping with Python‘ gave a very good talk on this very subject at DefCon 24. She talked extensively how the task of scraping ‘difficult’ websites can be avoided by simply looking to leverage the underlying API’s which power these modern web applications. She provided one specific example of a site, that a client had asked to scrape which had it’s own underlying API.
The site used by Ryan Mitchell as an example was the ‘Official Crossfit Affiliate’s Map’ site, which provides visitors to the Crossfit site with a way to navigate around a map containing the location of every single Crossfit affiliate. Anyone wanting to extract this information by automating the browser would likely have a torrid time trying to zoom and click on the various points on the map.
The site is in-fact powered by a very simple unprotected API. Every time an individual location is clicked on this triggers a request to the underlying API. This request returns a JSON file containing all of the required information. In order to discover this API endpoint, all one needs to do is have the developer console open while they play around with the map.
This makes our job particularly easy and we can collect the data from the map with a simple script. All we need to is iterate through each of the ID’s and then extract the required data from the JSON. We can then store this data as CSV, or in a simple SQLite database. This is just one specific example, but there are many other examples of sites where it is possible to pull useful information via a private API.
Instagram’s Private API
The Instagram web application also has a very easy to access API. Again discovering these API endpoint’s is not particularly difficult, you simply have to browse around the Instagram site with your console open to see the call’s made to the private Instagram API.
Once you are logged into Instagram, you are able to make requests to this API and receive significant amounts of data in return. This is particularly useful as the official Instagram API requires significant amounts of approval and vetting before it can be used beyond a limited number of trial accounts.
When using the browser with the developer console active, you will notice a number of XHR requests pop up in your console. Many of these can be manipulated to pull out useful and interesting data points.
Using the above URL contains the user’s unique Instagram ID and allows to pull out the user’s most recent 4,000 followers. This can be used to discover and target individuals who follow a particular Instagram user. Should you be logged into the Instagram’s web platform, the above URL should return Kim Kardashian’s 4,000 most recent followers.
There are a number of other useful API’s easily accessible from the web application, but I will let the reader explore these for themselves.
Should you be looking to extract data from ‘difficult’ to scrape sites, it’s definitely worth looking for underlying private API’s. Often the data you need can be accessed from these API endpoints. Grabbing this well formatted and consistently data will save you a lot time and avoids some of the headaches associated with browser automation.