17 datasets found

h
msmarco
huggingface.co
Updated Feb 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2025). msmarco [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco
Explore at:
Dataset updated
Feb 13, 2025
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO Training Dataset

This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:

triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.
W
Webis MS MARCO Anchor Text 2022
webis.de
anthology.aicmu.ac.cn
5883456
Updated 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen (2022). Webis MS MARCO Anchor Text 2022 [Dataset]. http://doi.org/10.5281/zenodo.5883456
Explore at:
5883456Available download formats
Unique identifier
https://doi.org/10.5281/zenodo.5883456
Dataset updated
2022
Dataset provided by
University of Kassel, hessian.AI, and ScaDS.AI
The Web Technology & Information Systems Network
Friedrich Schiller University Jena
Authors
Maik Fröbe; Maximilian Probst; Martin Potthast; Matthias Hagen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.
h
ms-marco-en-bge-gemma
huggingface.co
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LightOn AI (2025). ms-marco-en-bge-gemma [Dataset]. https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma
Explore at:
Dataset updated
Apr 28, 2025
Dataset authored and provided by
LightOn AI
Description
ms-marco-en-bge

This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.

knowledge distillation

To fine-tune a model using knowledge distillation loss we will need three distinct file:

Datasetsfrom datasets import load_dataset

train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.
h
lighton-ms-marco-mini
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LightOn AI (2024). lighton-ms-marco-mini [Dataset]. https://huggingface.co/datasets/lightonai/lighton-ms-marco-mini
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Dataset authored and provided by
LightOn AI
Description
ms-marco-mini

This dataset gathers very few samples from MS MARCO to provide an example of triplet-based / knowledge distillation dataset formatting.

triplet subset

The triplet file is all we need to fine-tune a model based on contrastive loss.

Columns: "query", "positive", "negative" Column types: str, str, str Examples:{ "query": "what are the liberal arts?", "positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/lighton-ms-marco-mini.
h
msmarco-tr
huggingface.co
Updated Apr 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Parsa Kazerooni (2025). msmarco-tr [Dataset]. https://huggingface.co/datasets/parsak/msmarco-tr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Authors
Parsa Kazerooni
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "msmarco-tr"

More Information needed
h
msmarco-scores-ms-marco-MiniLM-L6-v2
huggingface.co
Updated Jun 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2025). msmarco-scores-ms-marco-MiniLM-L6-v2 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2
Explore at:
Dataset updated
Jun 24, 2025
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO query-passage scores using cross-encoder/ms-marco-MiniLM-L6-v2

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. This dataset contains 160 million CrossEncoder scores on the MS MARCO dataset, using the cross-encoder/ms-marco-MiniLM-L6-v2 model. The scores are unprocessed logits, i.e. they don't range between 0...1, and they can be used for finetuning search models using distillation. See… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2.
h
msmarco-msmarco-MiniLM-L6-v3
huggingface.co
Updated Jun 15, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2016). msmarco-msmarco-MiniLM-L6-v3 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L6-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 15, 2016
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO with hard negatives from msmarco-MiniLM-L6-v3

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

Related Datasets

These are the datasets generated using the 13 different models:

msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L6-v3.
h
MS-Marco-Prompt-generation
huggingface.co
Updated Feb 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Satyanshu Kumar (2024). MS-Marco-Prompt-generation [Dataset]. https://huggingface.co/datasets/satyanshu404/MS-Marco-Prompt-generation
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 3, 2024
Authors
Satyanshu Kumar
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
satyanshu404/MS-Marco-Prompt-generation dataset hosted on Hugging Face and contributed by the HF Datasets community
h
msmarco-mpnet-margin-mse-mean-v1
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers, msmarco-mpnet-margin-mse-mean-v1 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO with hard negatives from mpnet-margin-mse-mean-v1

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

Related Datasets

These are the datasets generated using the 13 different models:

msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1.
h
msmarco
huggingface.co
Updated Mar 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2024). msmarco [Dataset]. https://huggingface.co/datasets/mteb/msmarco
Explore at:
Dataset updated
Mar 2, 2024
Dataset authored and provided by
Massive Text Embedding Benchmark
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
MSMARCO An MTEB dataset Massive Text Embedding Benchmark

MS MARCO is a collection of datasets focused on deep learning in search

Task category t2t

Domains Encyclopaedic, Academic, Blog, News, Medical, Government, Reviews, Non-fiction, Social, Web Reference https://microsoft.github.io/msmarco/

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_tasks(["MSMARCO"]) evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/msmarco.
h
ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tom Aarsen, ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates [Dataset]. https://huggingface.co/datasets/tomaarsen/ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates
Explore at:
Authors
Tom Aarsen
Description
tomaarsen/ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates dataset hosted on Hugging Face and contributed by the HF Datasets community
h
msmarco-hard-negatives
huggingface.co
Updated Nov 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2021). msmarco-hard-negatives [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2021
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO Passages Hard Negatives

This repository contains raw datasets, all of which have also been formatted for easy training in the MS MARCO Mined Triplets collection. We recommend looking there first.

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. This dataset repository contains files that are helpful to train bi-encoder models e.g. using sentence-transformers.

Training Code

You… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives.
h
msmarco-msmarco-distilbert-base-v3
huggingface.co
Updated Nov 19, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sentence Transformers (2014). msmarco-msmarco-distilbert-base-v3 [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 19, 2014
Dataset authored and provided by
Sentence Transformers
Description
MS MARCO with hard negatives from msmarco-distilbert-base-v3

MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.

Related Datasets

These are the datasets generated using the 13 different models:

msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3.
h
msmarco-w-instructions
huggingface.co
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samaya AI (2024). msmarco-w-instructions [Dataset]. https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2024
Dataset authored and provided by
Samaya AI
Description
Augmented MS MARCO dataset with Instructions

Dataset Summary

This dataset was used to train the Promptriever family of models. It contains the original MS MARCO training data along with instructions to go with each query. It also includes instruction-negatives, up to three per query. The dataset is designed to enable retrieval models that can be controlled via natural language prompts, similar to language models.

Languages

The dataset is primarily in English.… See the full description on the dataset page: https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions.
h
ms-marco-triplets-train
huggingface.co
Updated Apr 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amy Freear (2025). ms-marco-triplets-train [Dataset]. https://huggingface.co/datasets/amyf/ms-marco-triplets-train
Explore at:
Dataset updated
Apr 26, 2025
Authors
Amy Freear
Description
amyf/ms-marco-triplets-train dataset hosted on Hugging Face and contributed by the HF Datasets community
h
msmarco-nl
huggingface.co
Updated Jul 24, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Netherlands Forensic Institute (2011). msmarco-nl [Dataset]. https://huggingface.co/datasets/NetherlandsForensicInstitute/msmarco-nl
Explore at:
Dataset updated
Jul 24, 2011
Dataset authored and provided by
Netherlands Forensic Institute
Description
MS MARCO NL

This is a machine translation of the MS MARCO dataset. This dataset can be used to train sentence embedding models. In contrast to our previous translation, an LLM (GPT-4o mini) was used for the translation. This results in generally higher translation quality.

Source dataset

The dataset is based on the MS MARCO dataset.

Model

We used a deployment of GPT-4o mini using the Microsoft Azure OpenAI APIs.

Prompt

The following… See the full description on the dataset page: https://huggingface.co/datasets/NetherlandsForensicInstitute/msmarco-nl.
h
ms_marco_sr
huggingface.co
Updated Oct 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SmartCat (2024). ms_marco_sr [Dataset]. https://huggingface.co/datasets/smartcat/ms_marco_sr
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 10, 2024
Dataset authored and provided by
SmartCat
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for Serbian MS MARCO (Subset)

Dataset Summary

This dataset is a Serbian translation of the first 8,000 examples from Microsoft's MS MARCO (Machine Reading Comprehension) dataset. It contains pairs of questions and human-generated answers, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language. The original MS MARCO dataset… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/ms_marco_sr.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Sentence Transformers (2025). msmarco [Dataset]. https://huggingface.co/datasets/sentence-transformers/msmarco

msmarco

MS MARCO

sentence-transformers/msmarco

Explore at:

Dataset updated

Feb 13, 2025

Dataset authored and provided by

Sentence Transformers

Description

MS MARCO Training Dataset

This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:

triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.

Clear search

Close search

Google apps

Main menu

msmarco

Webis MS MARCO Anchor Text 2022

ms-marco-en-bge-gemma

lighton-ms-marco-mini

msmarco-tr

msmarco-scores-ms-marco-MiniLM-L6-v2

msmarco-msmarco-MiniLM-L6-v3

MS-Marco-Prompt-generation

msmarco-mpnet-margin-mse-mean-v1

msmarco

ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates

msmarco-hard-negatives

msmarco-msmarco-distilbert-base-v3

msmarco-w-instructions

ms-marco-triplets-train

msmarco-nl

ms_marco_sr

msmarco

MS MARCO

sentence-transformers/msmarco