This post avoids the legal and ethical questions surrounding web scraping and simply focuses on the technical aspect of avoiding detection. We are going to look at some of the most effective ways to avoid being detected while crawling/scraping the modern web.
Switching User Agent’s
Switching or randomly selecting user agent’s is one of the most effective tactics in avoiding detection. Many sys admins and IT managers monitor the number of requests made by different user agents. If they see an abnormally large number of requests from one IP & User Agent, it makes the decision a very simple one – block the offending user agent/IP.
You see this in effect when scraping using the standard headers provided by common HTTP libraries. Try and request an Amazon page using Python’s requests standard headers and you will instantly be served a 503 error.
This makes it key to change up the user agents used by your crawler/scraper. Typically, I would recommend randomly selecting a user-agent from a list of commonly used user-agents. I have written a short post on how to do this using Python’s request library.
Other Request Headers
Even when you make effort to switch up user agents it may be obvious that you are running a crawler/scraper. Sometimes it can be that other elements of your header are giving you away. HTTP libraries tend to send different accept and accept encoding headers to those sent by real browsers. It can be worth modifying these headers to ensure that you like as much like a real browser as possible.
Many times, when scraping is detected it’s a matter of having made to many requests in too little time. It’s abnormal for a very large number of requests to be made from one IP in short space of time, making any scraper or crawler trying to going too fast a prime target. Simply waiting a few seconds between each request will likely mean that you will fly under the radar of anyone trying to stop you. It some instances going slower may mean you are not able to collect the data you need quickly enough. If this is the case you probably need to be using proxies.
In some situations, proxies are going to be must. When a site is actively discouraging scraping, proxies makes it appear that you are making requests are coming from multiple sources. This typically allows you to make a larger number of requests than you typically would be allowed to make. There are a large number of SaaS companies providing SEO’s and digital marketing firms with Google ranking data, these firms frequently rotate and monitor the health of their proxies in order to extract huge amounts of data from Google.
Alternatively, you can use a lightweight headless browser such as Splash to do the scraping for you. The specialist headless browser even lets you implement AdBlock Plus rules allowing you to render pages faster and can be used alongside the popular Scrapy framework.
What many crawlers and scrapers fail to do is back-off when they start getting served with 403 & 503 errors. By simply plugging on and requesting more pages after coming across a batch of error pages, it becomes pretty clear that you are in fact a bot. Slowing down and backing off when you get a bunch of forbidden errors can help you avoid a permanent ban.
Avoiding Honeypots/Bot Traps
Some webmasters implement honey traps which seek to capture bots by directing them to pages which sole purpose is to determine they are a bot. There is a very popular WordPress plugin which simply creates an empty ‘/blackhole/’ directory on your site. The link to this directory is then hidden in the site’s footer not visible to those using browsers. When designing a scraper or crawler for a particular site it is worth looking to determine whether any links are hidden to users loading the page with a standard browser.
Simply obeying robots.txt while crawling can save you a lot of hassle. While the robots.txt file itself provides no protection against scrapers/crawlers, some webmasters will simply block any IP which makes many requests to pages blocked within the robots.txt file. The proportion that of webmasters which actively do this is relatively small but obeying robots.txt can definitely save you some significant trouble. If the content you need to reach blocked off by the robots.txt file, you may just have to ignore the robots.txt file.
In some circumstances, it may be worth collecting and holding onto cookies. When scraping services such as Google, results returned by the search engine can be influenced by cookies. The majority of people scraping Google search results are not sending any cookie information with their request which is abnormal from a behaviour perspective. Provided that you do not mind about receiving personalised results it may be a good idea for some scrapers to send cookies along with their request.
Captchas are one of the more difficult to crack anti-scraping measures, fortunately captchas are incredibly annoying to real users. This means not many sites use them, and when used they are normally limited forms. Breaking captchas can either be done via computer vision tools such as tesseract-ocr or solutions can be purchased from a number API services which use humans to solve the underlying captchas. These services are available for even the latest Google image recaptcha’s and simply impose an additional cost on the person scraping.
By combining some of the advice above you should be able to scrape the vast majority of sites without ever coming across any issues.