MS MARCO Training Dataset
This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:
triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Webis MS MARCO Anchor Text 2022 dataset enriches Version 1 and 2 of the document collection of MS MARCO with anchor text extracted from six Common Crawl snapshots. The six Common Crawl snapshots cover the years 2016 to 2021 (between 1.7-3.4 billion documents each). Overall, the MS MARCO Anchor Text 2022 dataset enriches 1,703,834 documents for Version 1 and 4,821,244 documents for Version 2 with up to 1,000 anchor texts each.
ms-marco-en-bge
This dataset contains the MS MARCO dataset with negatives mined using ColBERT and then scored by bge-reranker-v2-gemma. It can be used to train a retrieval model using knowledge distillation, for example using PyLate.
knowledge distillation
To fine-tune a model using knowledge distillation loss we will need three distinct file:
Datasetsfrom datasets import load_dataset
train = load_dataset( "lightonai/ms-marco-en-gemma", "train"… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma.
ms-marco-mini
This dataset gathers very few samples from MS MARCO to provide an example of triplet-based / knowledge distillation dataset formatting.
triplet subset
The triplet file is all we need to fine-tune a model based on contrastive loss.
Columns: "query", "positive", "negative" Column types: str, str, str Examples:{ "query": "what are the liberal arts?", "positive": 'liberal arts. 1. the academic course of instruction at a college intended to provide general… See the full description on the dataset page: https://huggingface.co/datasets/lightonai/lighton-ms-marco-mini.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for "msmarco-tr"
More Information needed
MS MARCO query-passage scores using cross-encoder/ms-marco-MiniLM-L6-v2
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. This dataset contains 160 million CrossEncoder scores on the MS MARCO dataset, using the cross-encoder/ms-marco-MiniLM-L6-v2 model. The scores are unprocessed logits, i.e. they don't range between 0...1, and they can be used for finetuning search models using distillation. See… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-scores-ms-marco-MiniLM-L6-v2.
MS MARCO with hard negatives from msmarco-MiniLM-L6-v3
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.
Related Datasets
These are the datasets generated using the 13 different models:
msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-MiniLM-L6-v3.
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
satyanshu404/MS-Marco-Prompt-generation dataset hosted on Hugging Face and contributed by the HF Datasets community
MS MARCO with hard negatives from mpnet-margin-mse-mean-v1
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.
Related Datasets
These are the datasets generated using the 13 different models:
msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-mpnet-margin-mse-mean-v1.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
MSMARCO An MTEB dataset Massive Text Embedding Benchmark
MS MARCO is a collection of datasets focused on deep learning in search
Task category t2t
Domains Encyclopaedic, Academic, Blog, News, Medical, Government, Reviews, Non-fiction, Social, Web Reference https://microsoft.github.io/msmarco/
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["MSMARCO"]) evaluator… See the full description on the dataset page: https://huggingface.co/datasets/mteb/msmarco.
tomaarsen/ms-marco-n-tuple-scores-mxbai-embed-large-v1-20000-candidates dataset hosted on Hugging Face and contributed by the HF Datasets community
MS MARCO Passages Hard Negatives
This repository contains raw datasets, all of which have also been formatted for easy training in the MS MARCO Mined Triplets collection. We recommend looking there first.
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. This dataset repository contains files that are helpful to train bi-encoder models e.g. using sentence-transformers.
Training Code
You… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives.
MS MARCO with hard negatives from msmarco-distilbert-base-v3
MS MARCO is a large scale information retrieval corpus that was created based on real user search queries using the Bing search engine. For each query and gold positive passage, the 50 most similar paragraphs were mined using 13 different models. The resulting data can be used to train Sentence Transformer models.
Related Datasets
These are the datasets generated using the 13 different models:
msmarco-bm25… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco-msmarco-distilbert-base-v3.
Augmented MS MARCO dataset with Instructions
Dataset Summary
This dataset was used to train the Promptriever family of models. It contains the original MS MARCO training data along with instructions to go with each query. It also includes instruction-negatives, up to three per query. The dataset is designed to enable retrieval models that can be controlled via natural language prompts, similar to language models.
Languages
The dataset is primarily in English.… See the full description on the dataset page: https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions.
amyf/ms-marco-triplets-train dataset hosted on Hugging Face and contributed by the HF Datasets community
MS MARCO NL
This is a machine translation of the MS MARCO dataset. This dataset can be used to train sentence embedding models. In contrast to our previous translation, an LLM (GPT-4o mini) was used for the translation. This results in generally higher translation quality.
Source dataset
The dataset is based on the MS MARCO dataset.
Model
We used a deployment of GPT-4o mini using the Microsoft Azure OpenAI APIs.
Prompt
The following… See the full description on the dataset page: https://huggingface.co/datasets/NetherlandsForensicInstitute/msmarco-nl.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for Serbian MS MARCO (Subset)
Dataset Summary
This dataset is a Serbian translation of the first 8,000 examples from Microsoft's MS MARCO (Machine Reading Comprehension) dataset. It contains pairs of questions and human-generated answers, automatically translated from English to Serbian. The dataset is designed for evaluating embedding models on Question Answering (QA) and Information Retrieval (IR) tasks in the Serbian language. The original MS MARCO dataset… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/ms_marco_sr.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
MS MARCO Training Dataset
This dataset consists of 4 separate datasets, each using the MS MARCO Queries and passages:
triplets: This subset contains triplets of query-id, positive-id, negative-id as provided in qidpidtriples.train.full.2.tsv.gz from the MS MARCO Website. The only change is that this dataset has been reshuffled. This dataset can easily be used with an MultipleNegativesRankingLoss a.k.a. InfoNCE loss. labeled-list: This subset contains triplets of query-id, doc-ids… See the full description on the dataset page: https://huggingface.co/datasets/sentence-transformers/msmarco.