Today, we are going to walk through creating a basic crawler making use of Selenium.
Why Build A Selenium Web Crawler?
While Selenium is really a package designed to test web-pages, we can easily build out web crawler on top of the package.
Imports & Class Initialisation
from selenium import webdriver
from urllib.parse import urldefrag, urljoin
from collections import deque
from bs4 import BeautifulSoup
def __init__(self, base_url, exclusion_list, output_file='example.csv', start_url=None):
assert isinstance(exclusion_list, list), 'Exclusion list - needs to be a list'
self.browser = webdriver.Chrome() #Add path to your Chromedriver
self.base = base_url
self.start = start_url if start_url else base_url #If no start URL is passed use the base_url
self.exclusions = exclusion_list #List of URL patterns we want to exclude
self.crawled_urls =  #List to keep track of URLs we have already visited
self.url_queue = deque([self.start]) #Add the start URL to our list of URLs to crawl
self.output_file = output_file
To begin we import the libraries we are going to need. Only two of the libraries we are using here aren’t contained within Python’s standard library. Bs4 and Selenium can both be installed by using the pip command and installing these libraries should be relatively pain free.
We then begin with creating and initialising our SeleniumCrawler class. We pass a number of arguments to __init__.
Firstly, we define a base URL. Which we use to ensure that any links discovered during our crawl lie within the same domain/sub-domain. If you were crawling this site, you would pass ‘https://edmundmartin.com’ to the base_url argument.
We then take a list of any URLs or URL paths we may want to exclude. If we wanted to exclude any dynamic and sign in pages, we would pass something like [‘?’,’signin’] to the exclusion list argument. URLs matching these patterns would then never be added to our crawl queue.
We have an outputfile argument already defined which is just the file in which we will output our crawl results into. And then finally, we have a start URL argument which allows you start a crawl from a different URL than the site’s base URL.
Getting Pages With Selenium
def get_page(self, url):
except Exception as e:
We then create a get_page method. This simply grabs a url which is passed as an argument and then returns the pages html. If we have any issues with a particular page we simply log the exception and then return nothing.
Creating a Soup
def get_soup(self, html):
if html is not None:
soup = BeautifulSoup(html, 'lxml')
This is again a very simple method which simply checks that we have some html and creates a BeautifulSoup object from the soup. We are then going to use this soup to extract URLs to crawl and the information we are collecting.
def get_links(self, soup):
for link in soup.find_all('a', href=True): #All links which have a href element
link = link['href'] #The actually href element of the link
if any(e in link for e in self.exclusions): #Check if the link matches our exclusion list
continue #If it does we do not proceed with the link
url = urljoin(self.base, urldefrag(link)) #Resolve relative links using base and urldefrag
if url not in self.url_queue and url not in self.crawled_urls: #Check if link is in queue or already crawled
if url.startswith(self.base): #If the URL belongs to the same domain
self.url_queue.append(url) #Add the URL to our queue
Our get_links method takes our soup and finds all the links which we haven’t previously found. First, we find all the ‘a’ items which have ‘href’ attribute. We then check if these links contain anything within our exclusion list. If the URL should be excluded we move onto the next ‘href’. We use urljoin with urldefrag to resolve any relative URLs. We then check whether the URL has already been crawled or is already in our queue. If the URL matches our base domain we then finally add it to our queue.
def get_data(self, soup):
title = soup.find('title').get_text().strip().replace('\n','')
title = None
We then use our soup again to get the title of the article in question. If we come across any issues with getting the title we simply return the string ‘None’. This method could be expanded to collect any of the data you require from the page in question.
Writing to CSV
def csv_output(self, url, title):
with open(self.output_file, 'a', encoding='utf-8') as outputfile:
writer = csv.writer(outputfile)
We simply pass our URL and title to this method and then use the standard libraries CSV module to output the data to our target file.
Run Crawler Method
while len(self.url_queue): #If we have URLs to crawl - we crawl
current_url = self.url_queue.popleft() #We grab a URL from the left of the list
self.crawled_urls.append(current_url) #We then add this URL to our crawled list
html = self.get_page(current_url)
if self.browser.current_url != current_url: #If the end URL is different from requested URL - add URL to crawled list
soup = self.get_soup(html)
if soup is not None: #If we have soup - parse and write to our csv file
title = self.get_data(soup)
The run crawler method really just brings together all of our already defined methods. While we have unseen URLs, we continue to crawl and take an element from the left of our queue. We then add this to our crawled list and request the page.
Should the end URL be different from the URL we originally requested this URL is also added to the crawled list. This means we don’t visit URLs twice when a redirect has been put in place.
We then grab the soup from the html, and provided we have a soup object, we parse the links, grab the title and output the results to our CSV file.
What Can Be Improved?
There are a number of things that can be improved on in this example Selenium crawler.
Additionally, this crawler is going toe be relatively slow. It’s single threaded and uses Bs4 to parse pages, which is relatively slow compared with using lxml. Both the methods using Bs4 could be quite easily changed to use lxml.
The full code for this post can be found on my Github, feel free to fork, and make pull requests and see what you can do with this underlying basic recipe.