Web Scraping: Avoiding Detection

 

This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.

Switching User Agent’s

Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.

You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.

This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.

Other Request Headers

Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.

Going Slow

Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.

Proxies

In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.

Rendering JavaScript

JavaScript is pretty much used everywhere and the number of human’s not enabling JavaScript in their browsers is less than <1%. This means that some sites have looked to block IPs making large numbers of requests without rendering JavaScript. The simple solution is just to render JavaScript, using a headless browser and browser automation suite such as Selenium.

Increasingly, companies such as Cloudflare are checking whether users making requests to the site are rendering JavaScript. By using this technique, they hope to block bots making requests to the site in question. However, several libraries now exist which help you get around the kind of protection implemented by Cloudflare. Python’s cloudflare-scrape library is a wrapper around the requests library which simply run’s Cloudflare’s JavaScript test within a node environment should it detect that such a protection has been put in place.

Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.

Backing Off

What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.

Avoiding Honeypots/Bot Traps

Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.

Obeying Robots.txt

Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.

Cookies

In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.

Captchas

Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.

By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.

Leave a Reply

Your email address will not be published. Required fields are marked *