7 datasets found

h
msmarco-document-v2_trec-dl-2019_judged
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
msmarco-document-v2_trec-dl-2019_judged [Dataset]. https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged
Explore at:
Dataset authored and provided by
ir-datasets
Description
Dataset Card for msmarco-document-v2/trec-dl-2019/judged

The msmarco-document-v2/trec-dl-2019/judged dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

Data

This dataset provides:

queries (i.e., topics); count=43

For docs, use irds/msmarco-document-v2

For qrels, use irds/msmarco-document-v2_trec-dl-2019

Usage

from datasets import load_dataset

queries =… See the full description on the dataset page: https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged.
h
trec-dl-2019-query
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehan Li, trec-dl-2019-query [Dataset]. https://huggingface.co/datasets/jordane95/trec-dl-2019-query
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Authors
Zehan Li
Description
jordane95/trec-dl-2019-query dataset hosted on Hugging Face and contributed by the HF Datasets community
t
TREC Deep Learning 2019 - Dataset - LDM
service.tib.eu
Updated Dec 3, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). TREC Deep Learning 2019 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/trec-deep-learning-2019
Explore at:
Dataset updated
Dec 3, 2024
Description
Large-scale passage retrieval aims to fetch relevant passages from a million- or billion-scale collection for a given query to meet users’ information needs, serving as an important role in many downstream applications including open domain question answering, search engine, and recommendation system.
h
clinicaltrials_2019_trec-pm-2019
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
clinicaltrials_2019_trec-pm-2019 [Dataset]. https://huggingface.co/datasets/irds/clinicaltrials_2019_trec-pm-2019
Explore at:
Dataset authored and provided by
ir-datasets
Description
Dataset Card for clinicaltrials/2019/trec-pm-2019

The clinicaltrials/2019/trec-pm-2019 dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

Data

This dataset provides:

queries (i.e., topics); count=40

qrels: (relevance assessments); count=12,996

For docs, use irds/clinicaltrials_2019

Usage

from datasets import load_dataset

queries = load_dataset('irds/clinicaltrials_2019_trec-pm-2019'… See the full description on the dataset page: https://huggingface.co/datasets/irds/clinicaltrials_2019_trec-pm-2019.
Z
ir_metadata: An Extensible Metadata Schema for Information Retrieval...
data.niaid.nih.gov
zenodo.org
Updated Feb 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Breuer, Timo (2022). ir_metadata: An Extensible Metadata Schema for Information Retrieval Experiments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5997490
Explore at:
Dataset updated
Feb 21, 2022
Dataset provided by
Breuer, Timo
Schaer, Philipp
Keller, Jüri
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset accompanies our work that introduces a metadata schema for TREC run files based on the PRIMAD model. PRIMAD considers essential components of computational experiments that possibly can affect reproducibility on a conceptual level. We propose to align the metadata annotations to the PRIMAD components. In order to demonstrate the potential of metadata annotations, we curated a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotated these with the corresponding metadata. With this work, we hope to stimulate IR researchers to annotate run files and improve the reuse value of experimental artifacts even further.

This archive contains the following data:

demo.tar.xz : Selected annotated runs files that are used in the Colab demonstration.

metadata.zip : YAML files containing only the metadata annotations for each run.

runs.zip : The entire set of run files with annotations.

The annotated runs result from the following experiments:

Grossman and Cormack @ TREC Common Core 2017 Paper | Source

Grossman and Cormack @ TREC Common Core 2018 Paper | Source

Yu et al. @ TREC Common Core 2018 Paper | Source

Yu et al. @ ECIR 2019 Paper | Source

Breuer et al. @ SIGIR 2020 Paper | Source

Breuer et al. @ CLEF 2021 Paper | Source
Z
Data from: PANACEA dataset - Heterogeneous COVID-19 Claims
data.niaid.nih.gov
zenodo.org
Updated Jul 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Procter, Rob (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6493846
Explore at:
Dataset updated
Jul 15, 2022
Dataset provided by
He, Yulan
Kochkina, Elena
Arana-Catania, Miguel
Procter, Rob
Liakata, Maria
Zubiaga, Arkaitz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.

The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).

The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).

The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.

The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.

The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.

The data sources used are:

The CoronaVirusFacts/DatosCoronaVirus Alliance Database. https://www.poynter.org/ifcn-covid-19-misinformation/

CoAID dataset (Cui and Lee, 2020) https://github.com/cuilimeng/CoAID

MM-COVID (Li et al., 2020) https://github.com/bigheiniu/MM-COVID

CovidLies (Hossain et al., 2020) https://github.com/ucinlp/covid19-data

TREC Health Misinformation track https://trec-health-misinfo.github.io/

TREC COVID challenge (Voorhees et al., 2021; Roberts et al., 2020) https://ir.nist.gov/covidSubmit/data.html

The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).

The entries in the dataset contain the following information:

Claim. Text of the claim.

Claim label. The labels are: False, and True.

Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.

Original information source. Information about which general information source was used to obtain the claim.

Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.

Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).

References

Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.

Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.

Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.

Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.

Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.

Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.
P
BEIR Dataset
paperswithcode.com
library.toponeai.link
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych (2023). BEIR Dataset [Dataset]. https://paperswithcode.com/dataset/beir
Explore at:
Dataset updated
Dec 7, 2023
Authors
Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych
Description
BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.

The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:

MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

msmarco-document-v2_trec-dl-2019_judged [Dataset]. https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged

msmarco-document-v2_trec-dl-2019_judged

`msmarco-document-v2/trec-dl-2019/judged`

irds/msmarco-document-v2_trec-dl-2019_judged

Explore at:

Dataset authored and provided by

ir-datasets

Description

Dataset Card for msmarco-document-v2/trec-dl-2019/judged

The msmarco-document-v2/trec-dl-2019/judged dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

  Data

This dataset provides:

queries (i.e., topics); count=43

For docs, use irds/msmarco-document-v2

For qrels, use irds/msmarco-document-v2_trec-dl-2019

  Usage

from datasets import load_dataset

queries =… See the full description on the dataset page: https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged.

Clear search

Close search

Google apps

Main menu

msmarco-document-v2_trec-dl-2019_judged

trec-dl-2019-query

TREC Deep Learning 2019 - Dataset - LDM

clinicaltrials_2019_trec-pm-2019

ir_metadata: An Extensible Metadata Schema for Information Retrieval...

Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

BEIR Dataset

msmarco-document-v2_trec-dl-2019_judgedSee More Versions

`msmarco-document-v2/trec-dl-2019/judged`

irds/msmarco-document-v2_trec-dl-2019_judged

msmarco-document-v2_trec-dl-2019_judged