DAPFAM: A Domain-Specific Patent Retrieval Dataset Compiled at the Family Level
June 30, 2025 | by magnews24.com

Introducing DAPFAM: A New Frontier in Patent Retrieval Datasets
In recent years, the demand for efficient patent retrieval mechanisms has grown significantly, highlighting the importance of well-structured datasets in driving innovation in the field. However, the existing patent retrieval datasets frequently suffer from a lack of explicit in-domain and out-of-domain labeling, insufficient coverage across multiple jurisdictions, and imbalanced representation of query domains. These issues pose challenges for researchers and practitioners aiming to conduct comprehensive retrieval evaluations, particularly those operating within limited computational environments.
To address these shortcomings, we present the Domain-Aware Patent Retrieval Dataset, or DAPFAM—a valuable contribution to the field of patent information retrieval. Constructed at the simple-family level, DAPFAM comprises 1,247 domain-balanced full-text query families paired with 45,336 corresponding full-text target families. This robust dataset is enhanced with meticulously defined relevance judgments, utilizing forward and backward citations as positive links while introducing a set of random negatives.
One of the most innovative features of DAPFAM is its use of a newly devised labeling scheme grounded in the International Patent Classification (IPC) codes, enabling explicit distinction between in-domain and out-of-domain relationships. The result is a comprehensive dataset containing 49,869 evaluation pairs, significantly enriching the landscape of available patent retrieval data.
Moreover, DAPFAM boasts multi-jurisdictional coverage and is designed to require minimal preprocessing for retrieval evaluations. Crucially, the dataset is structured to be manageable for entities with limited computational resources, facilitating sub-document level retrieval experiments without imposing excessive computational costs. This accessibility paves the way for more inclusive participation in patent retrieval research, allowing a wider array of institutions and individuals to contribute to and benefit from advancements in this domain.
Our methodology includes a detailed three-step data-curation pipeline, the specifics of which are delineated in our publication alongside comprehensive dataset statistics. Baseline experiments employing both lexical and neural retrieval methodologies reveal significant challenges in cross-domain patent retrieval, echoing the complexity and dynamism of the field.
As the landscape of patent research evolves, the availability of structured and accessible datasets like DAPFAM is both timely and crucial. Researchers and organizations interested in advancing their capabilities in patent retrieval can access the dataset publicly via this repository.
In summary, DAPFAM not only fills critical gaps in current patent retrieval methodologies but also serves as an essential resource for advancing research and practices in a domain that shapes the future of technological innovation and intellectual property management.
RELATED POSTS
View all