Scraping & Health Monitoring free proxies with Python

When web-scraping, you often need to source a number of proxies in order to avoid being banned or get around rate limiting imposed by the website in question. This often see’s developers purchasing proxies from some sort of commercial provider, this can become quite costly if you are only need the proxies for a short period of time. So in this post we are going to look at how you might use proxies from freely available proxy lists to scrape the internet.

Problems With Free Proxies

  • Free Proxies Die Very Quickly
  • Free Proxies Get Blocked By Popular Sites
  • Free Proxies Frequently Timeout

While free proxies are great in the sense that they are free, they tend to be highly unreliable. This is due to the fact that up-time is inconsistent and these proxies get blocked quickly by popular sites such as Google. Our solution is also going to build in some monitoring of the current status of the proxy in question. Allowing us to avoid using proxies which are currently broken.

Scraping Proxies

We are going to use free-proxy-list.net, as our source for this example. But the example could easily be expanded to cover multiple sources of proxies. We simply write a simple method which visits the page and pulls out all the proxies from the page in question using our chosen user-agent.  We then store the results in a dictionary, with each proxy acting as a key holding the information relating to that particular proxy. We are not doing any error handling, this will be handled in our ProxyManager class.

Proxy Manager

Our proxy manager is a simply class which allows us to get and manage the proxies we find on free-proxy-list.net. We pass in a test URL which will be used to test whether the proxy is working and a user agent to be used for both scraping and testing the proxies in question. We also create a thread pool, so we can more quickly check the status of the proxies we have scraped. We then call our update_proxy_list, returning the proxies we have found on free-proxy-list.net into our dictionary of proxies.

Checking Proxies

We can now write a couple of methods to test whether a particular proxy works. The first method takes the proxy and the dictionary of information related to the proxy in question. We immediately set the last checked variable to the current time. We make a request against our test URL, with a relatively short timeout. We also then check the status of the request raising an exception should we receive a non-200 status code. Should anything go wrong, we then set the status of the proxy to dead, otherwise we set the status to alive.

We then write our refresh proxy status which simple calls our check proxy status. We iterate over our dictionary, submitting each proxy and the related info of to a thread. If we didn’t use threads to check the status of our proxies, we could be waiting a very long time for our results. We then loop through our results and update the status of proxy in question.

Getting A Proxy

We then write two methods for getting ourselves a proxy. Our first method allows us to get a list of proxies by passing in a relevant key and value. This method allows us to get a list of proxies that relate to a particular country or boasts a particular level anonymity. This can be useful should we be interested in particular properties of a proxy.

We also have a simple method that allows us to return a single working proxy. This returns the first working proxy found within our proxy dictionary by looping over all the items in the dictionary, and returning the first proxy where ‘alive’ is equal to true.

Example Usage

Using the library is pretty simple. We just create the class passing in our test URL (using Google.com here) and our selected user-agent. We then call refresh_proxy_status, updating the status of the scraped proxies by running them against our test URL. We can then pull out an individual working proxy. We can then update our proxy list with a fresh scrape of our source should we not be satisfied with the proxies we currently have access to.

Full Code

Leave a Reply

Your email address will not be published. Required fields are marked *