Ultimate Introduction to Web Scraping in Python: From Novice to Expert

Python is one of the most accessible fully featured programming languages, which makes it a perfect language for those looking to learn to program. This post aims to introduce the reader to web scraping, allowing them to build their own scrapers and crawlers to collect data from the internet.

Contents

  1. Introduction to Web Scraping
  2. Making HTTP Requests with Python
  3. Handling HTTP Errors
  4. Parsing HTML with BeautifulSoup

MORE TO COME

Introduction to Web Scraping

Web Scraping, sometimes referred to screen scraping is the practice of using programs to visit websites an extract information from them. This allows users to collect information from the web in a programmatic manner, as opposed to having to manually visit a page and extract the information into some sort of data store. At their core, major search engines such as Google, and Bing make use of web scraping to extract information from millions of pages every day.
Web scraping has a wide range of uses, including but not limited to fighting copyright infringement, collecting business intelligence, collecting data for data science, and for use within the fintech industry. This mega post is aimed at teaching you how to build scrapers and crawlers which will allow you to extract data from a wide range of sites.

This post assumes that you have Python 3.5+ installed and you have learnt how to install libraries via Pip. If not, it would be a good time to Google ‘how to install python’ and ‘how to use pip’. Those familiar with the requests library may want to skip ahead several parts.

Making HTTP Requests with Python

When accessing a website our browser makes a number of HTTP requests in the background. The majority of internet users aren’t aware of the number of HTTP requests required to access a web page. These requests load the page itself and may make additional requests to resources which our loaded by the page such as images, videos and style sheets. You can see a breakdown of the requests made by opening up your browser’s development tools and navigating to the ‘Network’ tab.

The majority of requests made to a website are made using a ‘GET’ request. As the name suggests a ‘GET’ request attempts to retrieve the content available at the specified address. HTTP supports a variety of other methods such as ‘POST’, ‘DELETE’, ‘PUT’ and ‘OPTIONS’. These methods are sometimes referred to as HTTP verbs. We will discuss these methods later.

Python’s standard library contains a module which allows us to make HTTP requests. While this library is perfectly functional the user interface is not particularly friendly. In this mega post we are going to make use of the requests library which provides us with a much friendlier user-interface and can be installed using the command below.

Making a HTTP request with Python can be done in a couple of lines. Below we are going to demonstrate how to make a request and walk through the code line by line.

First, we import the requests library which gives us access to the functions contained within the library. We then make a HTTP request to ‘http://edmundmartin.com’ using the ‘GET’ verb, by calling the get method contained within the requests library. We store the result of this request in a variable named ‘response_object’. The response object contains a number of pieces of information that are useful when scraping the web. Here, we access the text (HTML) of the response which we print to the screen.  Provided the site is up and available users running this script should be greeted with a wall of HTML.

Handling HTTP Errors

When making HTTP requests there is significant room for things to go wrong. Your internet collection may be down or the site in question may not be reachable. When scraping the internet we typically want to handle this errors and continue on without crashing our program.  For this we are going to want to write a function, which will allow us to make a HTTP request and deal with any errors. Additionally, by encapsulating this logic within a function we can reuse our code with greater ease, by simply calling the function every time we want to make HTTP request.

The below code is an example of a function which makes a request and deals with a number of common errors we are likely to encounter. The function is explained in more detail below.

Our basic get_request function takes one argument, the string of the URL we want to retrieve. We then make the request just as before. This time however our request is wrapped in a try and except block which allows to catch any errors should something go wrong. After the request we then then check the status code of our response. Every time you make a request the server in question will respond with a code indicating whether the request has been a success or not. If everything went fine then you will receive a 200 status code, otherwise you are likely to receive a 404 (‘Page Not Found’) or 503 (‘Service Unavailable’). By default, the requests library does not throw an error should a web server respond with a bad status code but rather continues silently. By using raise_for_status we force an error should we receive a bad status code. Should there be no error thrown we then return our response object.

If all did not go so well, we then handle all of our errors. Firstly, we check whether the page responded with a non-200 status code, by catching the requests.HTTPError. We then check whether the request failed due to a bad connection by checking for the requests.ConnectionError exception. Finally, we use the generic requests.RequestException to catch all other exceptions that can be thrown by the requests library. The ordering of our exceptions is important, the requests.RequestException is the most generic and would catch either of the other exceptions. Should this have been the first exception handled, the other lines of code would never ever run regardless of the reason for the exception.

When handling each exception, we use the standard library’s logging library to print out a message of what went wrong when making the request. This is very handy and is a good habit to get into, as it makes debugging programs much easier. If an exception is thrown we return nothing from our function, which we can then check later. Otherwise we return the response.

At the bottom of the script, I have provided a simple example of how this function could be used to print out a page’s HTML response.

Parsing HTML with BeautifulSoup

So everything we have done has been rather boring and not particularly useful. This is due to the fact that we have just been making requests and then printing the HTML. We can however do much more interesting things we our responses.

This is where BeautifulSoup comes in. BeautifulSoup is a library for the parsing of HTML, allowing us to easily extract the elements of the page that we are most interested in. While BeautifulSoup is not the fastest way to parse a page, it has a very beginner friendly API. The BeautifulSoup library can be install by using the following:

The code below expands on the code we wrote in the previous section and actually uses our response for something.  A full explanation can be found after the code snippet.

The code snippet above uses the same get_request function as before which I have removed for the sake of brevity. Firstly, we must import the BeautifulSoup library, we do this by adding the line ‘from bs4 import BeautifulSoup’. Doing this gives us access to the BeautifulSoup class which is used for parsing HTML responses. We then generate a ‘soup’ by passing our HTML to BeautifulSoup, here we also pass a string signifying the underlying html parsing library to be used. This is not required but BeautifulSoup will print a rather long winded warning should you omit this.

Once the soup has been created we can then use the ‘find_all’ method to discover all of the elements matching our search. The soup object, also has a method ‘find’ which will only return the first element matching our search. In this example, we first pass in the name of the HTML element we want to select. In this case it’s the heading 2 element represented in HTML by ‘h2’. We then pass a dictionary containing additional information. On my blog all article titles are ‘h2’ elements, with the ‘class’ of ‘entry-title’. This class attribute is what is used by CSS to make the titles stand out from the rest of the page, but can help us in selecting the elements of the page which we want.

Should our selector find anything we should be returned with a list of title elements. We can then write a for loop, which goes through each of these titles and prints the text of the title, by calling the get_text() method. A note of caution, should your selector not find anything calling the get_text() method on the result will throw an exception.  Should everything run without any errors the code snippet above should return the titles of the ten most recent articles from my website. This is all that is really required to get started with extracting information from websites, though picking the correct selector can take a little bit of work.

In the next section we are going to write a scraper which will extract information from Google, using what we have learnt so far.

Leave a Reply

Your email address will not be published. Required fields are marked *