19 datasets found
  1. h

    M-BEIR

    • huggingface.co
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TIGER-Lab (2023). M-BEIR [Dataset]. https://huggingface.co/datasets/TIGER-Lab/M-BEIR
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 7, 2023
    Dataset authored and provided by
    TIGER-Lab
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    UniIR: Training and Benchmarking Universal Multimodal Information Retrievers (ECCV 2024)

    🌐 Homepage | 🤗 Model(UniIR Checkpoints) | 🤗 Paper | 📖 arXiv | GitHub How to download the M-BEIR Dataset

      🔔News
    

    🔥[2023-12-21]: Our M-BEIR Benchmark is now available for use.

      Dataset Summary
    

    M-BEIR, the Multimodal BEnchmark for Instructed Retrieval, is a comprehensive large-scale retrieval benchmark designed to train and evaluate unified multimodal retrieval… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/M-BEIR.

  2. h

    M-BEIR_DEV

    • huggingface.co
    Updated Jun 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Benchmarking IR (2022). M-BEIR_DEV [Dataset]. https://huggingface.co/datasets/MBEIR/M-BEIR_DEV
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 24, 2022
    Dataset authored and provided by
    Multimodal Benchmarking IR
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    MBEIR/M-BEIR_DEV dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    hotpotqa

    • huggingface.co
    Updated Aug 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2022). hotpotqa [Dataset]. https://huggingface.co/datasets/BeIR/hotpotqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 24, 2022
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/hotpotqa.

  4. h

    prebuilt-indexes-mbeir

    • huggingface.co
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Castorini (2025). prebuilt-indexes-mbeir [Dataset]. https://huggingface.co/datasets/castorini/prebuilt-indexes-mbeir
    Explore at:
    Dataset updated
    Aug 14, 2025
    Dataset authored and provided by
    Castorini
    Description

    castorini/prebuilt-indexes-mbeir dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    trec-news-generated-queries

    • huggingface.co
    Updated Aug 20, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2022). trec-news-generated-queries [Dataset]. https://huggingface.co/datasets/BeIR/trec-news-generated-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2022
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/trec-news-generated-queries.

  6. h

    fever

    • huggingface.co
    Updated Aug 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2023). fever [Dataset]. https://huggingface.co/datasets/BeIR/fever
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2023
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/fever.

  7. h

    climate-fever

    • huggingface.co
    Updated Aug 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2023). climate-fever [Dataset]. https://huggingface.co/datasets/BeIR/climate-fever
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2023
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/climate-fever.

  8. h

    cqadupstack-generated-queries

    • huggingface.co
    Updated Aug 11, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2022). cqadupstack-generated-queries [Dataset]. https://huggingface.co/datasets/BeIR/cqadupstack-generated-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 11, 2022
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/cqadupstack-generated-queries.

  9. h

    dbpedia-entity

    • huggingface.co
    • opendatalab.com
    Updated Aug 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR (2023). dbpedia-entity [Dataset]. https://huggingface.co/datasets/BeIR/dbpedia-entity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 16, 2023
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/dbpedia-entity.

  10. h

    beir-nl-nq

    • huggingface.co
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLiPS (2025). beir-nl-nq [Dataset]. https://huggingface.co/datasets/clips/beir-nl-nq
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    CLiPS
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR-NL Benchmark

      Dataset Summary
    

    BEIR-NL is a Dutch-translated version of the BEIR benchmark, a diverse and heterogeneous collection of datasets covering various domains from biomedical and financial texts to general web content. Our benchmark is integrated into the Massive Multilingual Text Embedding Benchmark (MMTEB). BEIR-NL contains the following tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018… See the full description on the dataset page: https://huggingface.co/datasets/clips/beir-nl-nq.

  11. h

    mbeir-fashion-passage

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Norman, mbeir-fashion-passage [Dataset]. https://huggingface.co/datasets/michael-norman/mbeir-fashion-passage
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Michael Norman
    Description

    michael-norman/mbeir-fashion-passage dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. beir-embed-english-v3

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cohere, beir-embed-english-v3 [Dataset]. https://huggingface.co/datasets/Cohere/beir-embed-english-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Coherehttps://cohere.com/
    Description

    BEIR embeddings with Cohere embed-english-v3.0 model

    This datasets contains all query & document embeddings for BEIR, embedded with the Cohere embed-english-v3.0 embedding model.

      Overview of datasets
    

    This repository hosts all 18 datasets from BEIR, including query and document embeddings. The following table gives an overview of the available datasets. See the next section how to load the individual datasets.

    Dataset nDCG@10

    Documents

    arguana 53.98 8,674… See the full description on the dataset page: https://huggingface.co/datasets/Cohere/beir-embed-english-v3.

  13. h

    webis-touche2020-generated-queries

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BEIR, webis-touche2020-generated-queries [Dataset]. https://huggingface.co/datasets/BeIR/webis-touche2020-generated-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    BEIR
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR Benchmark

      Dataset Summary
    

    BEIR is a heterogeneous benchmark that has been built from 18 diverse datasets representing 9 information retrieval tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018 Bio-Medical IR: TREC-COVID, BioASQ, NFCorpus News Retrieval: TREC-NEWS, Robust04 Argument Retrieval: Touche-2020, ArguAna Duplicate Question Retrieval: Quora, CqaDupstack Citation-Prediction: SCIDOCS Tweet… See the full description on the dataset page: https://huggingface.co/datasets/BeIR/webis-touche2020-generated-queries.

  14. h

    beir-nl-hotpotqa

    • huggingface.co
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLiPS (2025). beir-nl-hotpotqa [Dataset]. https://huggingface.co/datasets/clips/beir-nl-hotpotqa
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    CLiPS
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR-NL Benchmark

      Dataset Summary
    

    BEIR-NL is a Dutch-translated version of the BEIR benchmark, a diverse and heterogeneous collection of datasets covering various domains from biomedical and financial texts to general web content. Our benchmark is integrated into the Massive Multilingual Text Embedding Benchmark (MMTEB). BEIR-NL contains the following tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018… See the full description on the dataset page: https://huggingface.co/datasets/clips/beir-nl-hotpotqa.

  15. h

    beir-nl-climate-fever

    • huggingface.co
    Updated Feb 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CLiPS (2025). beir-nl-climate-fever [Dataset]. https://huggingface.co/datasets/clips/beir-nl-climate-fever
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2025
    Dataset authored and provided by
    CLiPS
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Dataset Card for BEIR-NL Benchmark

      Dataset Summary
    

    BEIR-NL is a Dutch-translated version of the BEIR benchmark, a diverse and heterogeneous collection of datasets covering various domains from biomedical and financial texts to general web content. Our benchmark is integrated into the Massive Multilingual Text Embedding Benchmark (MMTEB). BEIR-NL contains the following tasks:

    Fact-checking: FEVER, Climate-FEVER, SciFact Question-Answering: NQ, HotpotQA, FiQA-2018… See the full description on the dataset page: https://huggingface.co/datasets/clips/beir-nl-climate-fever.

  16. h

    arxiv-beir-500k-generated-queries

    • huggingface.co
    Updated Sep 5, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Algorithmic Research Group (2024). arxiv-beir-500k-generated-queries [Dataset]. https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv-beir-500k-generated-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 5, 2024
    Dataset authored and provided by
    Algorithmic Research Group
    Description

    Dataset Summary

    A BEIR style dataset derived from ArXiv

      Languages
    

    All tasks are in English (en).

      Dataset Structure
    

    The dataset contains a corpus, queries and qrels (relevance judgments file). They must be in the following format:

    corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields _id with unique document identifier, title with document title (optional) and text with document paragraph or passage. For example:… See the full description on the dataset page: https://huggingface.co/datasets/AlgorithmicResearchGroup/arxiv-beir-500k-generated-queries.

  17. h

    bioasq-top-20-gen-queries

    • huggingface.co
    Updated Mar 7, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INCOME (2023). bioasq-top-20-gen-queries [Dataset]. https://huggingface.co/datasets/income/bioasq-top-20-gen-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2023
    Dataset authored and provided by
    INCOME
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NFCorpus: 20 generated queries (BEIR Benchmark)

    This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.

    DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py

    Below contains the old dataset card for the BEIR benchmark.

      Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/bioasq-top-20-gen-queries.
    
  18. h

    fever-top-20-gen-queries

    • huggingface.co
    Updated Mar 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    INCOME (2023). fever-top-20-gen-queries [Dataset]. https://huggingface.co/datasets/income/fever-top-20-gen-queries
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2023
    Dataset authored and provided by
    INCOME
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    NFCorpus: 20 generated queries (BEIR Benchmark)

    This HF dataset contains the top-20 synthetic queries generated for each passage in the above BEIR benchmark dataset.

    DocT5query model used: BeIR/query-gen-msmarco-t5-base-v1 id (str): unique document id in NFCorpus in the BEIR benchmark (corpus.jsonl). Questions generated: 20 Code used for generation: evaluate_anserini_docT5query_parallel.py

    Below contains the old dataset card for the BEIR benchmark.

      Dataset Card for BEIR… See the full description on the dataset page: https://huggingface.co/datasets/income/fever-top-20-gen-queries.
    
  19. h

    fever_ft

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sep Zeighami, fever_ft [Dataset]. https://huggingface.co/datasets/sepz/fever_ft
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Sep Zeighami
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The dataset contains a random 0.7/0.1/0.2 train/dev/test splits of fever dataset from BEIR https://github.com/beir-cellar/beir for benchmarking embedding model fine-tuning.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
TIGER-Lab (2023). M-BEIR [Dataset]. https://huggingface.co/datasets/TIGER-Lab/M-BEIR

M-BEIR

M-BEIR

TIGER-Lab/M-BEIR

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 7, 2023
Dataset authored and provided by
TIGER-Lab
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers (ECCV 2024)

🌐 Homepage | 🤗 Model(UniIR Checkpoints) | 🤗 Paper | 📖 arXiv | GitHub How to download the M-BEIR Dataset

  🔔News

🔥[2023-12-21]: Our M-BEIR Benchmark is now available for use.

  Dataset Summary

M-BEIR, the Multimodal BEnchmark for Instructed Retrieval, is a comprehensive large-scale retrieval benchmark designed to train and evaluate unified multimodal retrieval… See the full description on the dataset page: https://huggingface.co/datasets/TIGER-Lab/M-BEIR.

Search
Clear search
Close search
Google apps
Main menu