7 datasets found
  1. h

    msmarco-document-v2_trec-dl-2019_judged

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    msmarco-document-v2_trec-dl-2019_judged [Dataset]. https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged
    Explore at:
    Dataset authored and provided by
    ir-datasets
    Description

    Dataset Card for msmarco-document-v2/trec-dl-2019/judged

    The msmarco-document-v2/trec-dl-2019/judged dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

      Data
    

    This dataset provides:

    queries (i.e., topics); count=43

    For docs, use irds/msmarco-document-v2

    For qrels, use irds/msmarco-document-v2_trec-dl-2019

      Usage
    

    from datasets import load_dataset

    queries =… See the full description on the dataset page: https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged.

  2. h

    trec-dl-2019-query

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehan Li, trec-dl-2019-query [Dataset]. https://huggingface.co/datasets/jordane95/trec-dl-2019-query
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Zehan Li
    Description

    jordane95/trec-dl-2019-query dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. t

    TREC Deep Learning 2019 - Dataset - LDM

    • service.tib.eu
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). TREC Deep Learning 2019 - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/trec-deep-learning-2019
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    Large-scale passage retrieval aims to fetch relevant passages from a million- or billion-scale collection for a given query to meet users’ information needs, serving as an important role in many downstream applications including open domain question answering, search engine, and recommendation system.

  4. h

    clinicaltrials_2019_trec-pm-2019

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    clinicaltrials_2019_trec-pm-2019 [Dataset]. https://huggingface.co/datasets/irds/clinicaltrials_2019_trec-pm-2019
    Explore at:
    Dataset authored and provided by
    ir-datasets
    Description

    Dataset Card for clinicaltrials/2019/trec-pm-2019

    The clinicaltrials/2019/trec-pm-2019 dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

      Data
    

    This dataset provides:

    queries (i.e., topics); count=40

    qrels: (relevance assessments); count=12,996

    For docs, use irds/clinicaltrials_2019

      Usage
    

    from datasets import load_dataset

    queries = load_dataset('irds/clinicaltrials_2019_trec-pm-2019'… See the full description on the dataset page: https://huggingface.co/datasets/irds/clinicaltrials_2019_trec-pm-2019.

  5. Z

    ir_metadata: An Extensible Metadata Schema for Information Retrieval...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Breuer, Timo (2022). ir_metadata: An Extensible Metadata Schema for Information Retrieval Experiments [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5997490
    Explore at:
    Dataset updated
    Feb 21, 2022
    Dataset provided by
    Breuer, Timo
    Schaer, Philipp
    Keller, Jüri
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset accompanies our work that introduces a metadata schema for TREC run files based on the PRIMAD model. PRIMAD considers essential components of computational experiments that possibly can affect reproducibility on a conceptual level. We propose to align the metadata annotations to the PRIMAD components. In order to demonstrate the potential of metadata annotations, we curated a dataset with run files derived from experiments with different instantiations of PRIMAD components and annotated these with the corresponding metadata. With this work, we hope to stimulate IR researchers to annotate run files and improve the reuse value of experimental artifacts even further.

    This archive contains the following data:

    demo.tar.xz : Selected annotated runs files that are used in the Colab demonstration.

    metadata.zip : YAML files containing only the metadata annotations for each run.

    runs.zip : The entire set of run files with annotations.

    The annotated runs result from the following experiments:

    Grossman and Cormack @ TREC Common Core 2017 Paper | Source

    Grossman and Cormack @ TREC Common Core 2018 Paper | Source

    Yu et al. @ TREC Common Core 2018 Paper | Source

    Yu et al. @ ECIR 2019 Paper | Source

    Breuer et al. @ SIGIR 2020 Paper | Source

    Breuer et al. @ CLEF 2021 Paper | Source

  6. Z

    Data from: PANACEA dataset - Heterogeneous COVID-19 Claims

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Procter, Rob (2022). PANACEA dataset - Heterogeneous COVID-19 Claims [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6493846
    Explore at:
    Dataset updated
    Jul 15, 2022
    Dataset provided by
    He, Yulan
    Kochkina, Elena
    Arana-Catania, Miguel
    Procter, Rob
    Liakata, Maria
    Zubiaga, Arkaitz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The peer-reviewed publication for this dataset has been presented in the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), and can be accessed here: https://arxiv.org/abs/2205.02596. Please cite this when using the dataset.

    This dataset contains a heterogeneous set of True and False COVID claims and online sources of information for each claim.

    The claims have been obtained from online fact-checking sources, existing datasets and research challenges. It combines different data sources with different foci, thus enabling a comprehensive approach that combines different media (Twitter, Facebook, general websites, academia), information domains (health, scholar, media), information types (news, claims) and applications (information retrieval, veracity evaluation).

    The processing of the claims included an extensive de-duplication process eliminating repeated or very similar claims. The dataset is presented in a LARGE and a SMALL version, accounting for different degrees of similarity between the remaining claims (excluding respectively claims with a 90% and 99% probability of being similar, as obtained through the MonoT5 model). The similarity of claims was analysed using BM25 (Robertson et al., 1995; Crestani et al., 1998; Robertson and Zaragoza, 2009) with MonoT5 re-ranking (Nogueira et al., 2020), and BERTScore (Zhang et al., 2019).

    The processing of the content also involved removing claims making only a direct reference to existing content in other media (audio, video, photos); automatically obtained content not representing claims; and entries with claims or fact-checking sources in languages other than English.

    The claims were analysed to identify types of claims that may be of particular interest, either for inclusion or exclusion depending on the type of analysis. The following types were identified: (1) Multimodal; (2) Social media references; (3) Claims including questions; (4) Claims including numerical content; (5) Named entities, including: PERSON − People, including fictional; ORGANIZATION − Companies, agencies, institutions, etc.; GPE − Countries, cities, states; FACILITY − Buildings, highways, etc. These entities have been detected using a RoBERTa base English model (Liu et al., 2019) trained on the OntoNotes Release 5.0 dataset (Weischedel et al., 2013) using Spacy.

    The original labels for the claims have been reviewed and homogenised from the different criteria used by each original fact-checker into the final True and False labels.

    The data sources used are:

    The LARGE dataset contains 5,143 claims (1,810 False and 3,333 True), and the SMALL version 1,709 claims (477 False and 1,232 True).

    The entries in the dataset contain the following information:

    • Claim. Text of the claim.

    • Claim label. The labels are: False, and True.

    • Claim source. The sources include mostly fact-checking websites, health information websites, health clinics, public institutions sites, and peer-reviewed scientific journals.

    • Original information source. Information about which general information source was used to obtain the claim.

    • Claim type. The different types, previously explained, are: Multimodal, Social Media, Questions, Numerical, and Named Entities.

    Funding. This work was supported by the UK Engineering and Physical Sciences Research Council (grant no. EP/V048597/1, EP/T017112/1). ML and YH are supported by Turing AI Fellowships funded by the UK Research and Innovation (grant no. EP/V030302/1, EP/V020579/1).

    References

    • Arana-Catania M., Kochkina E., Zubiaga A., Liakata M., Procter R., He Y.. Natural Language Inference with Self-Attention for Veracity Assessment of Pandemic Claims. NAACL 2022 https://arxiv.org/abs/2205.02596

    • Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi at trec-3. Nist Special Publication Sp,109:109.

    • Fabio Crestani, Mounia Lalmas, Cornelis J Van Rijsbergen, and Iain Campbell. 1998. “is this document relevant?. . . probably” a survey of probabilistic models in information retrieval. ACM Computing Surveys (CSUR), 30(4):528–552.

    • Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.

    • Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Document ranking with a pre-trained sequence-to-sequence model. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 708–718.

    • Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. 2019. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.

    • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

    • Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23.

    • Limeng Cui and Dongwon Lee. 2020. Coaid: Covid-19 healthcare misinformation dataset. arXiv preprint arXiv:2006.00885.

    • Yichuan Li, Bohan Jiang, Kai Shu, and Huan Liu. 2020. Mm-covid: A multilingual and multimodal data repository for combating covid-19 disinformation.

    • Tamanna Hossain, Robert L. Logan IV, Arjuna Ugarte, Yoshitomo Matsubara, Sean Young, and Sameer Singh. 2020. COVIDLies: Detecting COVID-19 misinformation on social media. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020, Online. Association for Computational Linguistics.

    • Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. Trec-covid: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY, USA.

  7. P

    BEIR Dataset

    • paperswithcode.com
    • library.toponeai.link
    Updated Dec 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych (2023). BEIR Dataset [Dataset]. https://paperswithcode.com/dataset/beir
    Explore at:
    Dataset updated
    Dec 7, 2023
    Authors
    Nandan Thakur; Nils Reimers; Andreas Rücklé; Abhishek Srivastava; Iryna Gurevych
    Description

    BEIR (Benchmarking IR) is a heterogeneous benchmark containing different information retrieval (IR) tasks. Through BEIR, it is possible to systematically study the zero-shot generalization capabilities of multiple neural retrieval approaches.

    The benchmark contains a total of 9 information retrieval tasks (Fact Checking, Citation Prediction, Duplicate Question Retrieval, Argument Retrieval, News Retrieval, Question Answering, Tweet Retrieval, Biomedical IR, Entity Retrieval) from 19 different datasets:

    MS MARCO TREC-COVID NFCorpus BioASQ Natural Questions HotpotQA FiQA-2018 Signal-1M TREC-News ArguAna Touche 2020 CQADupStack Quora Question Pairs DBPedia SciDocs FEVER Climate-FEVER SciFact Robust04

  8. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
msmarco-document-v2_trec-dl-2019_judged [Dataset]. https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged

msmarco-document-v2_trec-dl-2019_judged

`msmarco-document-v2/trec-dl-2019/judged`

irds/msmarco-document-v2_trec-dl-2019_judged

Explore at:
Dataset authored and provided by
ir-datasets
Description

Dataset Card for msmarco-document-v2/trec-dl-2019/judged

The msmarco-document-v2/trec-dl-2019/judged dataset, provided by the ir-datasets package. For more information about the dataset, see the documentation.

  Data

This dataset provides:

queries (i.e., topics); count=43

For docs, use irds/msmarco-document-v2

For qrels, use irds/msmarco-document-v2_trec-dl-2019

  Usage

from datasets import load_dataset

queries =… See the full description on the dataset page: https://huggingface.co/datasets/irds/msmarco-document-v2_trec-dl-2019_judged.

Search
Clear search
Close search
Google apps
Main menu