Scraping Baidu with Python

 

What’s Baidu?

Baidu is China’s largest search engine and has been since Google left the market in {year}. As companies look to move into the Chinese market, there has been more and more interest in scraping search results from Baidu.

Scraping Baidu

Scraping Baidu is a relatively simple task. When scraping results from Baidu there is only minor challenge, the URLs displayed on the Baidu results page are found nowhere in the HTML. Baidu links to the sites displayed on the search results page via their own redirector service. In order to get the full final URL we have to follow these redirects. In this post we are going to walk through how to scrape the Baidu search results page.

Imports & Class Definition

In order to scrape Baidu, we only need to import two libraries outside of the standard library. Bs4 helps us parse HTML, while requests provides us with a nicer interface for making HTTP requests with Python.

As we are going to scrape multiple pages of Baidu in this tutorial and for this purpose we are going to initialise a class to hold onto the important information for us.

We initialise a new class of the BaiduBot, with a search term and the number of pages to scrape. We also give ourselves the ability to pass a number of keyword arguments to our class. This allows us to pass a proxy, a custom connection timeout, custom user agent and an optional delay between each of the results page we want to scrape. The keyword arguments may be of a lot of help, if we end up being block by Baidu.  When initialising the class we also store our base URL, which we use when scraping the subsequent pages.

Making Requests & Parsing HTML

We first define a function to scrape a page of Baidu, here we simply try to make a request and check that the response has a 200 Status. Should Baidu start serving us with non-200 status codes, this likely means that they have detected unusual behaviour from our IP and we should probably back off for a while. If there is no issue with the request, we simply return the response object.

Now that we have a way to make HTML requests, we need to write a method for parsing the results page. Our parser is going to take in the HTML and return us with a list of dictionary objects. Each result is handily contained within a ‘div’ called ‘c-container’. This makes it very easy for us to pick out each result. We can then iterate across all of our returned results, using relatively simply BeautifulSoup selectors. Before appending the result to our results list.

Getting the Underlying URL

As previously mentioned the full underlying URL is not displayed anywhere in Baidu’s search results. This means we must write a couple of functions to extract the full underlying URL. There may be another way to get this URL, but I’m not aware of it. If you know how, please share the method with me in the comments.

Our resolve_urls function is very similar to our Baidu request function. Instead of a response object we are returning the final URL by simply following the chain of redirects. Should we encounter any sort of error we are simply returning the original URL, as found within the search results. But this issue is relatively rare, so it shouldn’t impact our data too much.

The we write another function that allows us to use our resolve_urls function over a set of results, updating the URL within our dictionary with the real underlying URL and the rank of the URL in question.

Bringing It All Together

We bring this altogether in our scrape_baidu function. We range over our page count variable. For each loop we run through we multiple by our variable by 10, to get the correct pn variable. The pn variable represents the result index, so our logic ensures we start at 0 and continue on in 10 result increments. We then format our URL using both our search term and this variable. We then simply make the request and parse the page using the functions we have already written. Before appending the results to our final results variable. Should we have passed a delay argument, we will also sleep for a while before scraping the next page. This will help us avoided getting banned should we want to scrape multiple pages and search terms.

Full Code

 

Leave a Reply

Your email address will not be published. Required fields are marked *