When comparing Python parsing frameworks, you often hear people complaining that Beautiful Soup is considerably slower than using lxml. Thus, some people conclude that lxml should be used in any performance critical project. Having used Beautiful Soup in a large number of web scraping projects and never having had any real trouble with its performance, I wanted to properly measure the performance of the popular parsing library.
To test the two libraries, I wrote a simple single threaded crawler which crawls a total of 100 URLs and then simply extracts links and the page title from the page in question. By implementing two different parser methods, one using lxml and one using Beautiful Soup. I also tested the speed of the Beautiful Soup with various non-default parsers.
Each of the various setups were tested a total of five times to account for varying internet and server response times, with the below results outlining the different performance based on library and underlying parser.
|Run #1||Run #2||Run #3||Run #4||Run #5||Avg. Speed||Overhead Per Page (Seconds)|
As you can see lxml is significantly faster than Beautiful Soup. A pure lxml solution is several seconds faster than using Beautiful Soup with lxml as the underlying parser. The built Python parsing library is around 10 seconds slower, whereas the extremely liberal html5lib is even slower. The overhead per page parsed is still relatively small with both bs4(html.parser) and bs4(lxml) adding less than 0.1 seconds per page parsed.
|Overhead||100,000 URLs (Extra Hours)||500,000 URLs (Extra Hours)|
While the overhead seems very low, when you try and scale a crawler using Beautiful Soup will add a significant overhead. Even using Beautiful Soup with lxml adds significant overhead when you are trying to scale to hundreds of thousands of URLs. It should be noted that the above table assumes a crawler running a single thread. Anyone looking to crawl more than 100,000 URLs would be highly recommended to build a concurrent crawler making use of a library such as Twisted, Asycnio, or Concurrent futures.
So, the question of whether Beautiful Soup is suitable for your project really depends on the scale and nature of the project. Replacing Beautiful Soup with lxml is likely to see you achieve a small (but considerable at scale) performance improvements. This does however come at the cost of losing the Beautiful Soup API, which makes selecting on-page elements a breeze.