17 datasets found
  1. h

    msmarco

    • huggingface.co
    Updated Feb 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2025). msmarco [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco
    Explore at:
    Dataset updated
    Feb 13, 2025
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO Training Dataset

    This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:

    triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.

  2. W

    Webis MS MARCO Anchor Text 2022

    • webis.de
    • anthology.aicmu.ac.cn
    5883456
    Updated 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen (2022). Webis MS MARCO Anchor Text 2022 [Dataset]. http://doi.org/10.5281/zenodo.5883456
    Explore at:
    5883456Available download formats
    Dataset updated
    2022
    Dataset provided by
    University of Kassel, hessian.AI, and ScaDS.AI
    The Web Technology & Information Systems Network
    Friedrich Schiller University Jena
    Authors
    Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.

  3. h

    ms-marco-en-bge-gemma

    • huggingface.co
    Updated Apr 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LightOn AI (2025). ms-marco-en-bge-gemma [Dataset]. https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma
    Explore at:
    Dataset updated
    Apr 28, 2025
    Dataset authored and provided by
    LightOn AI
    Description

    ms-marco-en-bge

    This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.

      knowledge distillation
    

    To fine-tune a model using knowledge distillation loss we will need three distinct file:

    Datasetsfrom datasets import load_dataset

    train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.

  4. h

    lighton-ms-marco-mini

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LightOn AI (2024). lighton-ms-marco-mini [Dataset]. https://huggingface.co/datasets/lightonai/lighton-ms-marco-mini
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Dataset authored and provided by
    LightOn AI
    Description

    ms-marco-mini

    This dataset gathers very few samples from MS MARCO to provide an example of triplet-based / knowledge distillation dataset formatting.

      triplet subset
    

    The triplet file is all we need to fine-tune a model based on contrastive loss.

    Columns: "query", "positive", "negative" Column types: str, str, str Examples:{ "query": "what are the liberal arts?", "positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/lighton-ms-marco-mini.

  5. h

    msmarco-tr

    • huggingface.co
    Updated Apr 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Parsa Kazerooni (2025). msmarco-tr [Dataset]. https://huggingface.co/datasets/parsak/msmarco-tr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 30, 2025
    Authors
    Parsa Kazerooni
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "msmarco-tr"

    More Information needed

  6. h

    msmarco-scores-ms-marco-MiniLM-L6-v2

    • huggingface.co
    Updated Jun 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2025). msmarco-scores-ms-marco-MiniLM-L6-v2 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2
    Explore at:
    Dataset updated
    Jun 24, 2025
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO query-passage scores using cross-encoder/ms-marco-MiniLM-L6-v2

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. This dataset contains 160 million CrossEncoder scores on the MS MARCO dataset, using the cross-encoder/ms-marco-MiniLM-L6-v2 model. The scores are unprocessed logits, i.e. they don't range between 0...1, and they can be used for finetuning search models using distillation. See… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2.

  7. h

    msmarco-msmarco-MiniLM-L6-v3

    • huggingface.co
    Updated Jun 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2016). msmarco-msmarco-MiniLM-L6-v3 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L6-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2016
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO with hard negatives from msmarco-MiniLM-L6-v3

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

      Related Datasets
    

    These are the datasets generated using the 13 different models:

    msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L6-v3.

  8. h

    MS-Marco-Prompt-generation

    • huggingface.co
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Satyanshu Kumar (2024). MS-Marco-Prompt-generation [Dataset]. https://huggingface.co/datasets/satyanshu404/MS-Marco-Prompt-generation
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2024
    Authors
    Satyanshu Kumar
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    satyanshu404/MS-Marco-Prompt-generation dataset hosted on Hugging Face and contributed by the HF Datasets community

  9. h

    msmarco-mpnet-margin-mse-mean-v1

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers, msmarco-mpnet-margin-mse-mean-v1 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO with hard negatives from mpnet-margin-mse-mean-v1

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

      Related Datasets
    

    These are the datasets generated using the 13 different models:

    msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1.

  10. h

    msmarco

    • huggingface.co
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2024). msmarco [Dataset]. https://huggingface.co/datasets/mteb/msmarco
    Explore at:
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    MSMARCO An MTEB dataset Massive Text Embedding Benchmark

    MS MARCO is a collection of datasets focused on deep learning in search

    Task category t2t

    Domains Encyclopaedic, Academic, Blog, News, Medical, Government, Reviews, Non-fiction, Social, Web Reference https://microsoft.github.io/msmarco/

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["MSMARCO"]) evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/msmarco.

  11. h

    ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tom Aarsen, ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates [Dataset]. https://huggingface.co/datasets/tomaarsen/ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates
    Explore at:
    Authors
    Tom Aarsen
    Description

    tomaarsen/ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    msmarco-hard-negatives

    • huggingface.co
    Updated Nov 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2021). msmarco-hard-negatives [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2021
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO Passages Hard Negatives

    This repository contains raw datasets, all of which have also been formatted for easy training in the MS MARCO Mined Triplets collection. We recommend looking there first.

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. This dataset repository contains files that are helpful to train bi-encoder models e.g. using sentence-transformers.

      Training Code
    

    You… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives.

  13. h

    msmarco-msmarco-distilbert-base-v3

    • huggingface.co
    Updated Nov 19, 2014
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sentence Transformers (2014). msmarco-msmarco-distilbert-base-v3 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 19, 2014
    Dataset authored and provided by
    Sentence Transformers
    Description

    MS MARCO with hard negatives from msmarco-distilbert-base-v3

    MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

      Related Datasets
    

    These are the datasets generated using the 13 different models:

    msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3.

  14. h

    msmarco-w-instructions

    • huggingface.co
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samaya AI (2024). msmarco-w-instructions [Dataset]. https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    Samaya AI
    Description

    Augmented MS MARCO dataset with Instructions

      Dataset Summary
    

    This dataset was used to train the Promptriever family of models. It contains the original MS MARCO training data along with instructions to go with each query. It also includes instruction-negatives, up to three per query. The dataset is designed to enable retrieval models that can be controlled via natural language prompts, similar to language models.

      Languages
    

    The dataset is primarily in English.… See the full description on the dataset page: https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions.

  15. h

    ms-marco-triplets-train

    • huggingface.co
    Updated Apr 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amy Freear (2025). ms-marco-triplets-train [Dataset]. https://huggingface.co/datasets/amyf/ms-marco-triplets-train
    Explore at:
    Dataset updated
    Apr 26, 2025
    Authors
    Amy Freear
    Description

    amyf/ms-marco-triplets-train dataset hosted on Hugging Face and contributed by the HF Datasets community

  16. h

    msmarco-nl

    • huggingface.co
    Updated Jul 24, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netherlands Forensic Institute (2011). msmarco-nl [Dataset]. https://huggingface.co/datasets/NetherlandsForensicInstitute/msmarco-nl
    Explore at:
    Dataset updated
    Jul 24, 2011
    Dataset authored and provided by
    Netherlands Forensic Institute
    Description

    MS MARCO NL

    This is a machine translation of the MS MARCO dataset. This dataset can be used to train sentence embedding models. In contrast to our previous translation, an LLM (GPT-4o mini) was used for the translation. This results in generally higher translation quality.

      Source dataset
    

    The dataset is based on the MS MARCO dataset.

      Model
    

    We used a deployment of GPT-4o mini using the Microsoft Azure OpenAI APIs.

      Prompt
    

    The following… See the full description on the dataset page: https://huggingface.co/datasets/NetherlandsForensicInstitute/msmarco-nl.

  17. h

    ms_marco_sr

    • huggingface.co
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SmartCat (2024). ms_marco_sr [Dataset]. https://huggingface.co/datasets/smartcat/ms_marco_sr
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 10, 2024
    Dataset authored and provided by
    SmartCat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Serbian MS MARCO (Subset)

      Dataset Summary
    

    This dataset is a Serbian translation of the first 8,000 examples from Microsoft's MS MARCO (Machine Reading Comprehension) dataset. It contains pairs of questions and human-generated answers, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language. The original MS MARCO dataset… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/ms_marco_sr.

  18. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Sentence Transformers (2025). msmarco [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco

msmarco

MS MARCO

sentence-transformers/msmarco

Explore at:
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Sentence Transformers
Description

MS MARCO Training Dataset

This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:

triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.

Search
Clear search
Close search
Google apps
Main menu