Enriching Existing Test Collections with OXPath
Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely new ones with little effort.
💡 Research Summary
The paper addresses the labor‑intensive task of extending TREC‑style test collections with external resources. Traditional approaches require either custom API integration or the development of bespoke web crawlers, both of which demand substantial programming expertise and maintenance effort. The authors propose a lightweight alternative based on OXPath, an open‑source language that extends XPath with actions such as clicking, form filling, page navigation, hierarchical marking, and a Kleene star operator for iterating over paginated results.
To demonstrate the feasibility of the approach, the authors selected the GIRT4 collection, a multilingual CLEF test collection originally consisting only of document titles and short abstracts. They aimed to enrich GIRT4 with additional bibliographic metadata (editor, ISSN, ISBN, publisher, location, page numbers) by harvesting information from the social‑science portal Sowiport, specifically its SOLIS database. An OXPath script of roughly ten lines was written to: (1) open the Sowiport homepage, (2) apply a filter to restrict results to the SOLIS collection, (3) set the results page size to 100 items, (4) iterate through all result pages using the “next” button, (5) click each record’s title to open a detailed view, and (6) extract the desired fields using XPath expressions. The script also captures the acquisition ID present in SOLIS records, which matches the DOCID field in GIRT4, enabling a straightforward join operation.
The harvesting process succeeded for 13,214 out of 15,319 GIRT4 documents (approximately 86 %). The enriched collection, named GIRT4‑XT, now contains six new structured fields that were absent from the original dataset. This enrichment enables new evaluation scenarios, such as bibliometrics‑enhanced retrieval, where ISSN/ISBN can be used for journal‑level analysis, editors for author‑centric studies, and page counts for length‑based ranking features. Importantly, the original topics and relevance judgments of GIRT4 remain unchanged, allowing researchers to reuse the collection for comparative experiments without re‑creating the entire test suite.
The authors discuss several limitations of OXPath. Because the language renders the full HTML page and processes the DOM to simulate human interaction, extraction is relatively slow; processing hundreds of thousands of pages can take days on a single machine. OXPath does not provide built‑in parallelism, so scaling requires external orchestration (e.g., multithreading, distributed job queues). Tooling is also limited; the only dedicated development aid mentioned is an Atom editor extension. Despite these drawbacks, OXPath’s memory efficiency and declarative nature make it attractive for non‑technical users who need to harvest semi‑structured data quickly.
Future work outlined includes (1) integrating parallel execution mechanisms to accelerate large‑scale harvesting, (2) extending the methodology to other scholarly databases such as PubMed or CrossRef, thereby enabling the enrichment of TREC Genomics Track collections, and (3) developing higher‑level GUI tools to lower the entry barrier further. The overarching vision is to create a pipeline where a test collection can be built or enriched entirely from web resources with minimal coding, supporting a broad range of IR research, from cross‑lingual retrieval to bibliometric analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment