BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.
The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:
MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A collection of information retrieval benchmarks covering 15 corpora (1.9 billion documents) on which 32 well-known shared tasks are based. We filled the leaderboards with Docker images of 50 standard retrieval approaches. Within this setup, we were able to automatically run and evaluate the 50 approaches on the 32 tasks (1600 runs). All Benchmarks are added as training datasets because their qrels are already publicly available. Please find a detailed tutorial on how to submit approaches on github.
View on TIRA: https://tira.io/task-overview/ir-benchmarks
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
In scientific research, the ability to effectively retrieve relevant documents based on complex, multifaceted queries is critical. Existing evaluation datasets for this task are limited, primarily due to the high costs and effort required to annotate resources that effectively represent complex queries. To address this, we propose a novel task, Scientific DOcument Retrieval using Multi-level Aspect-based quEries (DORIS-MAE), which is designed to handle the complex nature of user queries in scientific research.
Documentations for the DORIS-MAE dataset is publicly available at https://github.com/Real-Doris-Mae/Doris-Mae-Dataset. This upload contains both DORIS-MAE dataset version 1 and ada-002 vector embeddings for all queries and related abstracts (used in candidate pool creation). DORIS-MAE dataset version 1 is comprised of four main sub-datasets, each serving distinct purposes.
The Query dataset contains 100 human-crafted complex queries spanning across five categories: ML, NLP, CV, AI, and Composite. Each category has 20 associated queries. Queries are broken down into aspects (ranging from 3 to 9 per query) and sub-aspects (from 0 to 6 per aspect, with 0 signifying no further breakdown required). For each query, a corresponding candidate pool of relevant paper abstracts, ranging from 99 to 138, is provided.
The Corpus dataset is composed of 363,133 abstracts from computer science papers, published between 2011-2021, and sourced from arXiv. Each entry includes title, original abstract, URL, primary and secondary categories, as well as citation information retrieved from Semantic Scholar. A masked version of each abstract is also provided, facilitating the automated creation of queries.
The Annotation dataset includes generated annotations for all 165,144 question pairs, each comprising an aspect/sub-aspect and a corresponding paper abstract from the query's candidate pool. It includes the original text generated by ChatGPT (version chatgpt-3.5-turbo-0301) explaining its decision-making process, along with a three-level relevance score (e.g., 0,1,2) representing ChatGPT's final decision.
Finally, the Test Set dataset contains human annotations for a random selection of 250 question pairs used in hypothesis testing. It includes each of the three human annotators' final decisions, recorded as a three-level relevance score (e.g., 0,1,2).
The file "ada_embedding_for_DORIS-MAE_v1.pickle" contains text embeddings for the DORIS-MAE dataset, generated by OpenAI's ada-002 model. The structure of the file is as follows:
├── ada_embedding_for_DORIS-MAE_v1.pickle
├── "Query"
│ ├── query_id_1 (Embedding of query_1)
│ ├── query_id_2 (Embedding of query_2)
│ └── query_id_3 (Embedding of query_3)
│ .
│ .
│ .
└── "Corpus"
├── corpus_id_1 (Embedding of abstract_1)
├── corpus_id_2 (Embedding of abstract_2)
└── corpus_id_3 (Embedding of abstract_3)
.
.
.
CoIR (Code Information Retrieval) benchmark, is designed to evaluate code retrieval capabilities. CoIR includes 10 curated code datasets, covering 8 retrieval tasks across 7 domains. In total, it encompasses two million documents. It also provides a common and easy Python framework, installable via pip, and shares the same data schema as benchmarks like MTEB and BEIR for easy cross-benchmark evaluations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is derived from the GermanQuAD dataset. This dataset takes the testset and represents it as qrels in the BEIR information retrieval benchmark format. Corpus and query ids have been added. The corresponding corpus can be found here. Full credit for the original dataset goes to the authors of the GermanQuAD dataset. The original dataset is licensed under CC BY-SA 4.0. Citation for the original dataset: @misc{möller2021germanquad, title={GermanQuAD and GermanDPR: Improving… See the full description on the dataset page: https://huggingface.co/datasets/mteb/germanquad-retrieval-qrels.
Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
French Taxation Embedding Benchmark (retrieval)
This dataset is designed for the task of retrieving relevant tax articles or content based on queries in the French language. It can be used for benchmarking information retrieval systems, particularly in the legal and financial domains.
Massive Text Embedding Benchmark for French Taxation
In this notebook, we will explore the process of adding a new task to the Massive Text Embedding Benchmark (MTEB). The MTEB is an… See the full description on the dataset page: https://huggingface.co/datasets/louisbrulenaudet/tax-retrieval-benchmark.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Legal Case Passage Extraction and Retrieval benchmark is an information retrieval benchmark collection for court case passage retrieval. Specifically, it is a collection for evaluating Cited Case Passage Retrieval (CCPR) and contains case passages from the Austrian building regulations domain (Source: RIS). The following files are included in the dataset:
A tab separated file containing the passage texts of court cases from the building regulations domain. Column 1 contains the ID of the passage, Column 2 contains the passage text and Column 3 contains the case ID (Geschäftszahl) of the origin case of the passage.
A tab separated file containing the queries / topics for which relevance assessments exist in this collection. Column 1 contains the ID of the query, Column 2 contains the query passage text and Column 3 contains the case ID (Geschäftszahl) of the cited case. For the task of CCPR, it is intended that results are additionally filtered based on exact matches of the case ID. For each query, only relevance assessments exist for passages that match the case ID of column 3.
Contains relevance assessments for each query. In this dictionary, a passage from the full collection is relevant for a query if qrel[
A conversion of the qrel.json file to be compatible with trec eval.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Music recommender systems can offer users personalized and contextualized recommendation and are therefore important for music information retrieval. An increasing number of datasets have been compiled to facilitate research on different topics, such as content-based, context-based or next-song recommendation. However, these topics are usually addressed separately using different datasets, due to the lack of a unified dataset that contains a large variety of feature types such as item features, user contexts, and timestamps. To address this issue, we propose a large-scale benchmark dataset called #nowplaying-RS, which contains 11.6 million music listening events (LEs) of 139K users and 346K tracks collected from Twitter. The dataset comes with a rich set of item content features and user context features, and the timestamps of the LEs. Moreover, some of the user context features imply the cultural origin of the users, and some others—like hashtags—give clues to the emotional state of a user underlying an LE. In this paper, we provide some statistics to give insight into the dataset, and some directions in which the dataset can be used for making music recommendation. We also provide standardized training and test sets for experimentation, and some baseline results obtained by using factorization machines.
The dataset contains three files:
Please also find the training and test-splits for the dataset in this repo. Also, prototypical implementations of a context-aware recommender system based on the dataset can be found at https://github.com/asmitapoddar/nowplaying-RS-Music-Reco-FM.
If you make use of this dataset, please cite the following paper where we describe and experiment with the dataset:
@inproceedings{smc18,
title = {#nowplaying-RS: A New Benchmark Dataset for Building Context-Aware Music Recommender Systems},
author = {Asmita Poddar and Eva Zangerle and Yi-Hsuan Yang},
url = {http://mac.citi.sinica.edu.tw/~yang/pub/poddar18smc.pdf},
year = {2018},
date = {2018-07-04},
booktitle = {Proceedings of the 15th Sound & Music Computing Conference},
address = {Limassol, Cyprus},
note = {code at https://github.com/asmitapoddar/nowplaying-RS-Music-Reco-FM},
tppubtype = {inproceedings}
}
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/trec-news-generated-queries.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
BEIR (Benchmarking IR) is an heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches. The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 17 different datasets: MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact
NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide large-scale multi-domain benchmark datasets for Personalized Search.
Further information can be found here.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Linked lists represent a countable number of ordered values, and are among the most important abstract data types in computer science. With the advent of RDF as a highly expressive knowledge representation language for the Web, various implementations for RDF lists have been proposed. Yet, there is no benchmark so far dedicated to evaluate the performance of triple stores and SPARQL query engines on dealing with ordered linked data. Moreover, essential tasks for evaluating RDF lists, like generating datasets containing RDF lists of various sizes, or generating the same RDF list using different modelling choices, are cumbersome and unprincipled. In this paper, we propose List.MID, a systematic benchmark for evaluating systems serving RDF lists. List.MID consists of a dataset generator, which creates RDF list data in various models and of different sizes; and a set of SPARQL queries. The RDF list data is coherently generated from a large, community-curated base collection of Web MIDI files, rich in lists of musical events of arbitrary length. We describe the List.MID benchmark, and discuss its impact and adoption, reusability, design, and availability.
CLIRMatrix is a large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. It includes:
BI-139: A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages.
In total, 49 million unique queries and 34 billion (query, document, label) triplets were mined, making CLIRMatrix the largest and most comprehensive CLIR dataset to date.
The Multi-Eup is a new multilingual benchmark dataset, comprising 22K multilingual documents collected from the European Parliament, spanning 24 languages. This dataset is designed to investigate fairness in a multilingual information retrieval (IR) context to analyze both language and demographic bias in a ranking context. It boasts an authentic multilingual corpus, featuring topics translated into all 24 languages, as well as cross-lingual relevance judgments. Furthermore, it offers rich demographic information associated with its documents, facilitating the study of demographic bias.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/arguana-qrels.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to fill this gap by introducing SE-PQA (StackExchange - Personalized Question Answering), a new resource to design and evaluate personalized models related to the two tasks of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with both questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization improves remarkably the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.
Performance on all communities separately:
<tbody><tr>
<th>Community</th>
<th>Model (BM25 +)</th>
<th>P@1</th>
<th>NDCG@3</th>
<th>NDCG@10</th>
<th>R@100</th>
<th>MAP@100</th>
<th>$\lambda$</th>
</tr>
</tbody><tbody>
<tr>
<td>Academia</td>
<td>MiniLM</td>
<td>0.438</td>
<td>0.382</td>
<td>0.395</td>
<td>0.489</td>
<td>0.344</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.453</td>
<td>0.392</td>
<td>0.403</td>
<td>0.489</td>
<td>0.352</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Anime</td>
<td>MiniLM + TAG</td>
<td>0.650</td>
<td>0.682</td>
<td>0.714</td>
<td>0.856</td>
<td>0.683</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Apple</td>
<td>MiniLM</td>
<td>0.327</td>
<td>0.351</td>
<td>0.381</td>
<td>0.514</td>
<td>0.349</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.335</td>
<td>0.361</td>
<td>0.389</td>
<td>0.514</td>
<td>0.357</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Bicycles</td>
<td>MiniLM</td>
<td>0.405</td>
<td>0.380</td>
<td>0.421</td>
<td>0.600</td>
<td>0.365</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.436</td>
<td>0.405</td>
<td>0.441</td>
<td>0.600</td>
<td>0.386</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Boardgames</td>
<td>MiniLM</td>
<td>0.681</td>
<td>0.694</td>
<td>0.728</td>
<td>0.866</td>
<td>0.692</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.696</td>
<td>0.702</td>
<td>0.736</td>
<td>0.866</td>
<td>0.699</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Buddhism</td>
<td>MiniLM + TAG</td>
<td>0.490</td>
<td>0.387</td>
<td>0.397</td>
<td>0.544</td>
<td>0.334</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Christianity</td>
<td>MiniLM</td>
<td>0.534</td>
<td>0.505</td>
<td>0.555</td>
<td>0.783</td>
<td>0.497</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.549</td>
<td>0.521</td>
<td>0.564</td>
<td>0.783</td>
<td>0.507</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Cooking</td>
<td>MiniLM</td>
<td>0.600</td>
<td>0.567</td>
<td>0.600</td>
<td>0.719</td>
<td>0.553</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.619</td>
<td>0.583</td>
<td>0.614</td>
<td>0.719</td>
<td>0.568</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>DIY</td>
<td>MiniLM</td>
<td>0.323</td>
<td>0.313</td>
<td>0.346</td>
<td>0.501</td>
<td>0.302</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.335</td>
<td>0.324</td>
<td>0.356</td>
<td>0.501</td>
<td>0.312</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Expatriates</td>
<td>MiniLM + TAG</td>
<td>0.596</td>
<td>0.653</td>
<td>0.682</td>
<td>0.832</td>
<td>0.645</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Fitness</td>
<td>MiniLM + TAG</td>
<td>0.568</td>
<td>0.575</td>
<td>0.613</td>
<td>0.760</td>
<td>0.567</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Freelancing</td>
<td>MiniLM + TAG</td>
<td>0.513</td>
<td>0.472</td>
<td>0.506</td>
<td>0.654</td>
<td>0.457</td>
<td>(.1,.9,.0)</td>
</tr>
<tr>
<td>Gaming</td>
<td>MiniLM</td>
<td>0.510</td>
<td>0.534</td>
<td>0.562</td>
<td>0.686</td>
<td>0.532</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.519</td>
<td>0.547</td>
<td>0.571</td>
<td>0.686</td>
<td>0.541</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Gardening</td>
<td>MiniLM</td>
<td>0.344</td>
<td>0.362</td>
<td>0.396</td>
<td>0.520</td>
<td>0.359</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.345</td>
<td>0.369</td>
<td>0.399</td>
<td>0.520</td>
<td>0.363</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Genealogy</td>
<td>MiniLM + TAG</td>
<td>0.592</td>
<td>0.605</td>
<td>0.631</td>
<td>0.779</td>
<td>0.594</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Health</td>
<td>MiniLM + TAG</td>
<td>0.718</td>
<td>0.765</td>
<td>0.797</td>
<td>0.934</td>
<td>0.765</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Gaming</td>
<td>MiniLM</td>
<td>0.510</td>
<td>0.534</td>
<td>0.562</td>
<td>0.686</td>
<td>0.532</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.519</td>
<td>0.547</td>
<td>0.571</td>
<td>0.686</td>
<td>0.541</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Hermeneutics</td>
<td>MiniLM</td>
<td>0.589</td>
<td>0.538</td>
<td>0.593</td>
<td>0.828</td>
<td>0.526</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.632</td>
<td>0.570</td>
<td>0.617</td>
<td>0.828</td>
<td>0.552</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Hinduism</td>
<td>MiniLM</td>
<td>0.388</td>
<td>0.415</td>
<td>0.459</td>
<td>0.686</td>
<td>0.416</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.382</td>
<td>0.410</td>
<td>0.457</td>
<td>0.686</td>
<td>0.412</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>History</td>
<td>MiniLM + TAG</td>
<td>0.740</td>
<td>0.735</td>
<td>0.764</td>
<td>0.862</td>
<td>0.730</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Hsm</td>
<td>MiniLM + TAG</td>
<td>0.666</td>
<td>0.707</td>
<td>0.737</td>
<td>0.870</td>
<td>0.690</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Interpersonal</td>
<td>MiniLM + TAG</td>
<td>0.663</td>
<td>0.617</td>
<td>0.653</td>
<td>0.739</td>
<td>0.604</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Islam</td>
<td>MiniLM</td>
<td>0.382</td>
<td>0.412</td>
<td>0.453</td>
<td>0.642</td>
<td>0.410</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.395</td>
<td>0.427</td>
<td>0.464</td>
<td>0.642</td>
<td>0.421</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Judaism</td>
<td>MiniLM + TAG</td>
<td>0.363</td>
<td>0.387</td>
<td>0.432</td>
<td>0.649</td>
<td>0.388</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Law</td>
<td>MiniLM</td>
<td>0.663</td>
<td>0.647</td>
<td>0.678</td>
<td>0.803</td>
<td>0.639</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.677</td>
<td>0.657</td>
<td>0.687</td>
<td>0.803</td>
<td>0.649</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Lifehacks</td>
<td>MiniLM</td>
<td>0.714</td>
<td>0.601</td>
<td>0.617</td>
<td>0.703</td>
<td>0.553</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.714</td>
<td>0.621</td>
<td>0.631</td>
<td>0.703</td>
<td>0.568</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Linguistics</td>
<td>MiniLM + TAG</td>
<td>0.584</td>
<td>0.588</td>
<td>0.630</td>
<td>0.794</td>
<td>0.587</td>
<td>(.2,.8,.0)</td>
</tr>
<tr>
<td>Literature</td>
<td>MiniLM + TAG</td>
<td>0.871</td>
<td>0.878</td>
<td>0.889</td>
<td>0.934</td>
<td>0.876</td>
<td>(.3,.7,.0)</td>
</tr>
<tr>
<td>Martialarts</td>
<td>MiniLM</td>
<td>0.630</td>
<td>0.599</td>
<td>0.645</td>
<td>0.796</td>
<td>0.596</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.640</td>
<td>0.628</td>
<td>0.660</td>
<td>0.796</td>
<td>0.612</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Money</td>
<td>MiniLM</td>
<td>0.545</td>
<td>0.535</td>
<td>0.563</td>
<td>0.706</td>
<td>0.515</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.559</td>
<td>0.542</td>
<td>0.571</td>
<td>0.706</td>
<td>0.523</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Movies</td>
<td>MiniLM</td>
<td>0.713</td>
<td>0.722</td>
<td>0.753</td>
<td>0.865</td>
<td>0.724</td>
<td>(.1,.9)</td>
</tr>
<tr>
<td> </td>
<td>MiniLM + TAG</td>
<td>0.728</td>
<td>0.735</td>
<td>0.762</td>
<td>0.865</td>
<td>0.735</td>
<td>(.1,.8,.1)</td>
</tr>
<tr>
<td>Music</td>
<td>MiniLM</td>
<td>0.508</td>
<td>0.447</td>
<td>0.476</td>
<td>0.602</td>
<td>0.418</td>
<td>(.2,.8)</td>
</tr>
<tr>
<td>
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The image queries were used in the following studies:* Y. Bitirim, S. Bitirim, D. Ç. Ertuğrul and Ö. Toygar, “An Evaluation of Reverse Image Search Performance of Google”, 2020 IEEE 44th Annual Computer Software and Applications Conference (COMPSAC), pp. 1368-1372, IEEE, Madrid, Spain, July 2020. (DOI: 10.1109/COMPSAC48688.2020.00-65)** Y. Bitirim, “Retrieval Effectiveness of Google on Reverse Image Search”, Journal of Imaging Science and Technology, Vol. 66, No. 1, pp. 010505-1-010505-6, January 2022. (DOI: 10.2352/J.ImagingSci.Technol.2022.66.1.010505)
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Dataset Card for BEIR Benchmark
Dataset Summary
BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:
Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/scifact.
BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.
The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:
MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04