When using the Python requests library to extract data from websites, you may want to avoid detection and minimise the chances of your scraping activities being detected.
Setting a Custom User-Agent
1 2 3 |
example_headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'} r = requests.get('http://edmund.com',headers=example_headers) |
To lower the chances of detention it is often recommended that users set a custom header. The requests library makes it very easy to set a custom user-agent. Often this is enough to avoid detection, with system administrators only looking for default user-agents when adding server side blocking rules.
Setting a Random User-Agent
If engaged in commercially sensitive scrapping, you may want to take additional precautions and randomise the User-Agent sent with each request.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from random import choice import requests desktop_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'] def random_headers(): return {'User-Agent': choice(desktop_agents),'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'} r = requests.get('https://edmundmartin.com',headers=random_headers()) |
The above snippet of code returns a random user-agent and Chrome’s default ‘Accept’ heading. When writing this code snippet I took efforts to include the ten most commonly used desktop browsers. It would probably be worth updating the browser list from time to time, to ensure that the user agents included in the list are up to date.
I have seen others loading a large list of hundreds and hundreds of user-agents. But I think this approach is misguided as it may see your crawlers make thousands of requests from very rarely used user agents.
Anyone looking closely at the ‘Accept’ headers will quickly realise that all of the different user agents are using the same ‘Accept’ header. Thankfully, the majority of system administrators completely ignore the intricacies of the sent ‘Accept’ headers and simply check if browsers are sending something plausible. Should it really be necessary, it would also be possible to send accurate ‘Accept’ headers with each request. I have never personally had to resort to this extreme measure.
Tnx u man :X
Thanks.
I would personally do it like this:
from fake_useragent import UserAgent
def random_headers():
return {UserAgent().random, ‘Accept’:’text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8′}
Hope it helps