Archiving Deferred Representations Using a Two-Tiered Crawling Approach

Archiving Deferred Representations Using a Two-Tiered Crawling Approach
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Web resources are increasingly interactive, resulting in resources that are increasingly difficult to archive. The archival difficulty is based on the use of client-side technologies (e.g., JavaScript) to change the client-side state of a representation after it has initially loaded. We refer to these representations as deferred representations. We can better archive deferred representations using tools like headless browsing clients. We use 10,000 seed Universal Resource Identifiers (URIs) to explore the impact of including PhantomJS – a headless browsing tool – into the crawling process by comparing the performance of wget (the baseline), PhantomJS, and Heritrix. Heritrix crawled 2.065 URIs per second, 12.15 times faster than PhantomJS and 2.4 times faster than wget. However, PhantomJS discovered 531,484 URIs, 1.75 times more than Heritrix and 4.11 times more than wget. To take advantage of the performance benefits of Heritrix and the URI discovery of PhantomJS, we recommend a tiered crawling strategy in which a classifier predicts whether a representation will be deferred or not, and only resources with deferred representations are crawled with PhantomJS while resources without deferred representations are crawled with Heritrix. We show that this approach is 5.2 times faster than using only PhantomJS and creates a frontier (set of URIs to be crawled) 1.8 times larger than using only Heritrix.


💡 Research Summary

The paper addresses a fundamental challenge in web archiving: the preservation of “deferred representations,” i.e., pages whose final content is assembled client‑side through JavaScript, Ajax, or other dynamic techniques after the initial HTTP response. Traditional crawlers such as wget and Heritrix do not execute JavaScript, so they miss many embedded resources, resulting in incomplete mementos, “zombie” pages, or HTTP 404 errors when archived versions are later replayed.

To quantify this problem, the authors built a dataset of 10 000 seed URIs by generating random Bitly links and following their redirects. The set was split into twenty 500‑URI batches, and each batch was crawled simultaneously by three tools: wget (a command‑line downloader), Heritrix (the Internet Archive’s multi‑threaded crawler), and PhantomJS (a headless browser that fully renders pages and executes JavaScript). Each crawl was repeated ten times to obtain stable averages. Two primary metrics were collected: (1) URIs per second (t_URI), measuring raw crawl speed, and (2) frontier size (|F|), the total number of distinct URIs discovered, including those embedded via client‑side code.

Results show a stark trade‑off. Heritrix processes 2.065 URIs/s, making it 12.13× faster than PhantomJS (0.170 URIs/s) and 2.39× faster than wget (0.864 URIs/s). However, Heritrix discovers far fewer resources: only 304 k distinct URIs, whereas PhantomJS uncovers 531 k, a 1.75× increase over Heritrix and a 4.11× increase over wget. The authors illustrate this with a concrete example (truthinshredding.com) where PhantomJS captures two of three JavaScript‑loaded assets while Heritrix captures none.

Recognizing that neither extreme is ideal, the authors propose a two‑tiered crawling strategy. A classifier—trained on features indicative of deferred representations—first predicts whether a page will require JavaScript execution. Pages flagged as “deferred” are fetched with PhantomJS; all others are processed with the high‑speed Heritrix. Simulated deployment of this hybrid approach yields a 5.2× speed improvement over a pure PhantomJS crawl while expanding the frontier by 1.8× compared to a pure Heritrix crawl. This demonstrates that selective use of headless browsing can combine the best of both worlds: high throughput and high completeness.

The paper situates its contribution within prior work on archivability metrics, JavaScript‑heavy page indexing, and Rich Internet Application (RIA) crawling models. Unlike earlier studies focused on search‑engine indexing or state‑machine extraction, this work directly measures archival outcomes—both crawl time and memento completeness. Limitations are acknowledged: PhantomJS’s single‑threaded nature, reliance on timeout heuristics that may miss late‑loading resources, and the dependence on classifier accuracy (false positives waste resources; false negatives leave gaps). The authors suggest future directions such as employing modern headless browsers (Puppeteer, Playwright) with parallel instances, refining timeout strategies, and enhancing the classifier with richer features.

In conclusion, the study provides empirical evidence that a tiered crawling architecture—leveraging a fast traditional crawler for static pages and a headless browser for dynamic, deferred pages—significantly improves web archive quality without prohibitive performance penalties. This approach offers a practical roadmap for large‑scale preservation initiatives seeking to capture the modern, JavaScript‑driven web.


Comments & Academic Discussion

Loading comments...

Leave a Comment