Today, we are going to walk through creating a basic crawler making use of Selenium.
Why Build A Selenium Web Crawler?
First, we should probably address why you might want to build a web crawler using Selenium? The modern web increasing uses front-end frameworks such as AngularJS and React which mean much of the data you might want to extract will not be readily available without rendering the page’s JavaScript. In instances like this you should first look into whether the site has underlying private API that you can easily make use of.
Additionally, you may find some site which run checks to ensure that users are running JavaScript. While there are other ways to get around this, running Selenium will typically make your crawler look like it’s a real browser instance. This is just one way you can work around scraping detection methods.
While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.
Imports & Class Initialisation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
import logging import csv from selenium import webdriver from urllib.parse import urldefrag, urljoin from collections import deque from bs4 import BeautifulSoup class SeleniumCrawler(object): def __init__(self, base_url, exclusion_list, output_file='example.csv', start_url=None): assert isinstance(exclusion_list, list), 'Exclusion list - needs to be a list' self.browser = webdriver.Chrome() #Add path to your Chromedriver self.base = base_url self.start = start_url if start_url else base_url #If no start URL is passed use the base_url self.exclusions = exclusion_list #List of URL patterns we want to exclude self.crawled_urls = [] #List to keep track of URLs we have already visited self.url_queue = deque([self.start]) #Add the start URL to our list of URLs to crawl self.output_file = output_file |
To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.
We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.
Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘https://edmundmartin.com’ to the base_url argument.
We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.
We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.
Getting Pages With Selenium
1 2 3 4 5 6 7 |
def get_page(self, url): try: self.browser.get(url) return self.browser.page_source except Exception as e: logging.exception(e) return |
We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.
Creating a Soup
1 2 3 4 5 6 |
def get_soup(self, html): if html is not None: soup = BeautifulSoup(html, 'lxml') return soup else: return |
This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.
Getting Links
1 2 3 4 5 6 7 8 9 10 |
def get_links(self, soup): for link in soup.find_all('a', href=True): #All links which have a href element link = link['href'] #The actually href element of the link if any(e in link for e in self.exclusions): #Check if the link matches our exclusion list continue #If it does we do not proceed with the link url = urljoin(self.base, urldefrag(link)[0]) #Resolve relative links using base and urldefrag if url not in self.url_queue and url not in self.crawled_urls: #Check if link is in queue or already crawled if url.startswith(self.base): #If the URL belongs to the same domain self.url_queue.append(url) #Add the URL to our queue |
Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.
Getting Data
1 2 3 4 5 6 7 8 |
def get_data(self, soup): try: title = soup.find('title').get_text().strip().replace('\n','') except: title = None return title |
We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.
Writing to CSV
1 2 3 4 5 6 |
def csv_output(self, url, title): with open(self.output_file, 'a', encoding='utf-8') as outputfile: writer = csv.writer(outputfile) writer.writerow([url, title]) |
We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.
Run Crawler Method
1 2 3 4 5 6 7 8 9 10 11 12 |
def run_crawler(self): while len(self.url_queue): #If we have URLs to crawl - we crawl current_url = self.url_queue.popleft() #We grab a URL from the left of the list self.crawled_urls.append(current_url) #We then add this URL to our crawled list html = self.get_page(current_url) if self.browser.current_url != current_url: #If the end URL is different from requested URL - add URL to crawled list self.crawled_urls.append(current_url) soup = self.get_soup(html) if soup is not None: #If we have soup - parse and write to our csv file self.get_links(soup) title = self.get_data(soup) self.csv_output(current_url, title) |
The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.
Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.
We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.
What Can Be Improved?
There are a number of things that can be improved on in this example Selenium crawler.
While I ran a test across over 1,000 URLs, the get_page method may be liable to break. To counter this, it would recommended to use more sophisticated error handling and importing Selenium’s common errors module. Additionally, this method could just get stuck waiting for ever if JavaScript fails to fully load. It would therefore be recommended to add some timeouts on the rendering of JavaScript, which is relatively easy with the Selenium library.
Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.
The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.
Nice article!
Selenium driver is my best driver for web scraping.
This is the only Python driver I know that scrapes javaScript.
Regards,