Scraping JavaScript Heavy Pages with Python & Splash

Scraping the modern web can be particularly challenging. These days many websites make use of JavaScript frameworks to serve much of a pages important content. This breaks traditional scrapers as our scrapers are unable to extract the infromation we need from our initial HTTP request.

So what should we do when we come across a site that makes extensive use of JavaScript? One option is to use Selenium. Selenium provides us with an easy to use API, with which we can automate a web browser. This great for tasks when we need to interact with the page, whether that be to scroll or click certain elements. It is however a bit over the top when you simply want to render JavaScript.

Introducing Splash

Splash, is a JavaScript rendering service from the creators of the popular Scrapy framework. Splash can be run as a server on your local machine. The server built using Twisted and Python allows us to scrape pages using the servers HTTP API. This means we can render JavaScript pages without the need for a full browser. The use of Twisted also means we can also

Installing Splash

Full instructions for installing Splash can be found in the Splash docs. That being said, it is highly reccomended that you use Splash with Docker which makes starting and stopping the server very easy.

Building A Custom Python Crawler With Splash

Splash was designed to be used with Scrapy and Scrapinghub, but it can just as easily be used with Python. In this example we are going to build a multi-threaded crawler using requests and Beautiful Soup. We are going to scrape an e-commerce website which uses a popular JavaScript library to load product information on category pages.

Imports & Class Initialisation

To write this scraper we are only going to use two libraries outside of the standard library. If you have ever done any web scraping before, you are likely to have both Requests and BeautifulSoup installed. Otherwise go ahead and grab them using pip.

We then create a SplashScraper class. Our crawler only takes one argument, namely the URL we want to begin our crawl from. We then use the URL parse library to create a string holding the site’s root URL, we use this URL to prevent our crawler from scraping pages not on our base domain.

One of the main selling points of Splash, is the fact that it is asynchronous. This means that we can render multiple pages at a time, making our crawler significantly more performant than using a standalone instance of Selenium. To make the most of this we are going to use a ThreadPool to scrape pages, allowing us to make up to twenty simultaneous requests.

We create queue which are going to use to grab URLs from and send to be executed in our thread pool. We then create a set to hold a list of all the pages we have already queued. Finally, we put the base URL into our queue, ensuring we start crawling from the base URL.

Extracting Links & Parsing Page Data

Next we define two methods to use with our scraped HTML. Firstly, we take the HTML and extract all the links which contain a href attribute. We iterate over our list of links pulling out the href element. If the URL starts with a slash or starts with the site’s URL, we call urlparse’s urljoin method which creates an absolute link out of the two strings. If we haven’t already crawled this page, we then add the URL to the queue.

Our scrape_info method simple takes the HTML and scrapes certain information from the rendered HTML. We then use some relatively rough logic to pull out name and price information before writing this information a CSV file. This method can be overwritten with custom logic to pull out the particular information you need.

Grabbing A Page & Defining Our Callback

When using a thread pool executor, one of the best ways of getting the result out of a function which will be run in a thread is to use a callback. The callback will be run once the function run in the thread has completed. We define a super simple callback that unpacks our result, and then checks whether the page gave us a 200 status code. If the page responded with a 200 hundred, we then run both our parse_links and scrape_info methods using the page’s HTML.

Our scrape_page function is very simple. As we are simply making a request to a server running locally we don’t need any error handling. We simply pass in a URL, which is then formatted into the request. We then simple return the response object which will then be used in our callback function defined above.

Our Crawling Method

Our run_scraper method is basically our main thread. We continue to try and get links from our queue. In this particular example we have set a timeout of 120 seconds. This means that if we are unable to grab a new URL from the queue, we will raise an Empty error and quit the program. Once we have our URL, we check if it is not in the our set of already scraped pages before adding it to the list.  We then send of the URL for scraping and set our callback method to run once we have completed our scrape. We ignore any exception and continue on with our scraping until we have run out of pages we haven’t seen before.

The script in it’s entirety can be found here on Github.

Leave a Reply

Your email address will not be published. Required fields are marked *