Crawlab AI: Building Intelligent Web Scrapers with Large Language Models (LLM)
"If I had asked people what they wanted, they would have said faster horses" -- Henry Ford
Preface
When I first entered the workforce, as a data analyst, I accidentally experienced the ability of web crawlers to automatically extract webpage data, and since then I've been fascinated by this magical technology. As I continued to delve into web scraping technology, I gradually understood the core technologies of web crawling, including web parsing - the process of analyzing webpage HTML structure to build data extraction rules based on XPath or CSS Selectors. This process has long required manual intervention. While relatively simple for scraping engineers, if large-scale extraction is needed, this process is very time-consuming, and as webpage structures change, it increases crawler maintenance costs. This article will introduce my LLM-based intelligent web scraping product: Crawlab AI. Although it's still in early development, it has already shown great potential and promises to make data acquisition easy for data practitioners.
Related Work
As the founder of the web scraping management platform Crawlab, I've always been passionate about making data acquisition simple and easy. Through constant communication with data practitioners, I realized the massive demand for intelligent scrapers (or universal scrapers) - extracting target data from any website without manually writing parsing rules. Of course, I'm not the only one researching and trying to solve this problem: In January 2020, Qingnan released the universal article parsing library GeneralNewsExtractor based on punctuation density, which can implement universal news crawlers with 4 lines of code; In July 2020, Cui Qingcai released GerapyAutoExtractor, implementing list page data extraction based on SVM algorithms; In April 2023, I developed Webspot through high-dimensional vector clustering algorithms, which can also automatically extract list pages. The main problem with these open-source software is that their recognition accuracy has some gaps compared to manually written crawler rules.
Additionally, commercial scraping software Diffbot and Octoparse have also implemented some universal data extraction functionality through proprietary machine learning algorithms. Unfortunately, their usage costs are relatively high. For example, Diffbot's lowest plan requires a monthly subscription fee of $299.