Facebook
TwitterEmploying the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB
logger = logging.getLogger(name)
model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
MLRSNet is a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.
The entire dataset is available as a huggingface dataset. In the form of Splits - https://huggingface.co/datasets/vigneshwar472/MLRS-Net-for-modelling In the form of Categories - https://huggingface.co/datasets/vigneshwar472/MLRS-Net
The 60 predefined class labels are
aiplane, airport, bare soil, baseball diamond, basketball court, beach, bridge, buildings, cars, chaparral, cloud, containers, crosswalk, dense residential area, desert, dock, factory, field, football field, forest, freeway, golf course, grass, greenhouse, gully, habor, intersection, island, lake, mobile home, mountain, overpass, park, parking lot, parkway, pavement, railway, railway station, river, road, roundabout, runway, sand, sea, ships, snow, snowberg, sparse residential area, stadium, swimming pool, tanks,
tennis court, terrace, track, trail, transmission tower, trees, water, wetland, wind turbine
The explanation of each directory can be found on data explorer.
Column descriptor of meta data files: The meta data files are available in labels directory and splits directory. Every metadata file has two colums - 1. image_id : id of images with which a user can fetch the corresponding .jpg file from corresponding folder 2. labels : all the labels associated with the image
Qi, Xiaoman; Zhu, Panpan; Wang, Yuebin; Zhang, Liqiang; Peng, Junhuan; Wu, Mengfan; Chen, Jialong; Zhao, Xudong; Zang, Ning; Mathiopoulos, P.Takis (2021),
“MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding”, Mendeley Data, V3, doi: 10.17632/7j9bv9vwsx.3
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NanoArguAnaRetrieval An MTEB dataset Massive Text Embedding Benchmark
NanoArguAna is a smaller subset of ArguAna, a dataset for argument retrieval in debate contexts.
Task category t2t
Domains Medical, Written
Referencehttp://argumentation.bplaced.net/arguana/data
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["NanoArguAnaRetrieval"]) evaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/NanoArguAnaRetrieval.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
CodeSearchNetRetrieval An MTEB dataset Massive Text Embedding Benchmark
The dataset is a collection of code snippets and their corresponding natural language queries. The task is to retrieve the most relevant code snippet for a given query.
Task category t2t
Domains Programming, Written
Reference https://huggingface.co/datasets/code_search_net/
Source datasets:
code-search-net/code_search_net
How to evaluate on this task
You can evaluate an embedding… See the full description on the dataset page: https://huggingface.co/datasets/mteb/CodeSearchNetRetrieval.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
ArguAna An MTEB dataset Massive Text Embedding Benchmark
NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval
Task category t2t
Domains Medical, Written
Reference http://argumentation.bplaced.net/arguana/data
How to evaluate on this task
You can evaluate an embedding model on this dataset using the following code: import mteb
task = mteb.get_tasks(["ArguAna"]) evaluator = mteb.MTEB(task)
model = mteb.get_model(YOUR_MODEL)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/arguana.
Facebook
TwitterSentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.
You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.
We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.
Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.
For the full documentation, see www.SBERT.net.
https://huggingface.co/sentence-transformers
The following publications are integrated in this framework: - Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) - Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020) - Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021) - The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020) - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ICLR 2025 Submision 9148
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
This is an anonymous repo for openreview https://openreview.net/forum?id=Usklli4gMc
Dataset Description
The dataset contains the following fields:
Field Name Description
id Unique identifier for the example
aspect Aspect type for the example
scenario The type of scenario associated with the entry
image Contains image data in byte… See the full description on the dataset page: https://huggingface.co/datasets/mragbenchanonymous/MRAG-Bench.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
https://huggingface.co/datasets/Metacreation/GigaMIDI
We present the extended GigaMIDI dataset [https://huggingface.co/datasets/Metacreation/GigaMIDI/viewer/v2.0.0], a large-scale symbolic music collection comprising over 2.1 million unique MIDI files with detailed annotations for music loop detection. Expanding on its predecessor, this release introduces a novel expressive loop detection method that captures performance nuances such as microtiming and dynamic variation, essential for advanced generative music modelling. Our method extends previous approaches, which were limited to strictly quantized, non-expressive tracks, by employing the Note Onset Median Metric Level (NOMML) heuristic to distinguish expressive from non-expressive material. This enables robust loop detection across a broader spectrum of MIDI data. Our loop detection method reveals more than 9.2 million non-expressive loops spanning all General MIDI instruments, alongside 2.3 million expressive loops identified through our new method. As the largest resource of its kind, the extended GigaMIDI dataset provides a strong foundation for developing models that synthesize structurally coherent and expressively rich musical loops. As a use case, we leverage this dataset to train an expressive multitrack symbolic music loop generation model using the MIDI-GPT system, resulting in the creation of a synthetic loop dataset.
Main curator: Keon Ju Maverick Lee
Assistance: Jeff Ens, Sara Adkins, Nathan Fradet, Pedro Sarmento, Mathieu Barthet, Phillip Long, Paul Triana
Research Director: Philippe Pasquier
Note: The GigaMIDI dataset is designed for continuous growth, with new subsets added and updated over time to ensure its ongoing expansion and relevance.
The dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This license permits users to share, adapt, and utilize the dataset exclusively for non-commercial purposes, including research and educational applications, provided that proper attribution is given to the original creators. By adhering to the terms of CC BY-NC 4.0, users ensure the dataset's responsible use while fostering its accessibility for academic and non-commercial endeavors.
You agree to use the GigaMIDI dataset only for non-commercial research or education without infringing copyright laws or causing harm to the creative rights of artists, creators, or musicians.
Currently, the extended GigaMIDI dataset is being under review at NeurIPS 2025 Dataset Track for Creative AI.
If you use the GigaMIDI dataset or any part of this project, please cite the following paper: https://transactions.ismir.net/articles/10.5334/tismir.203
@article{lee2025gigamidi,
title={The GigaMIDI Dataset with Features for Expressive Music Performance Detection},
author={Lee, Keon Ju Maverick and Ens, Jeff and Adkins, Sara and Sarmento, Pedro and Barthet, Mathieu and Pasquier, Philippe},
journal={Transactions of the International Society for Music Information Retrieval (TISMIR)},
volume={8},
number={1},
pages={1--19},
year={2025}
}
Facebook
Twitterhttps://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Information
This repository contains augmented versions of several datasets:
Synthetic-ConvQA NarrativeQA RAG-TGE
For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.
Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.
Facebook
Twitterhttps://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
PatternNet
The PatternNet dataset is a dataset for remote sensing scene classification and image retrieval.
Paper: https://arxiv.org/abs/1703.06339 Homepage: https://sites.google.com/view/zhouwx/dataset
Description
PatternNet is a large-scale high-resolution remote sensing dataset collected for remote sensing image retrieval. There are 38 classes and each class has 800 images of size 256×256 pixels. The images in PatternNet are collected from Google Earth… See the full description on the dataset page: https://huggingface.co/datasets/blanchon/PatternNet.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Arto (From Huggingface) [source]
The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.
Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.
Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.
Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.
Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »
Overview of the Dataset
The dataset consists of three primary files:
train.csv,test.csv, andvalid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.Understanding the Files
- train.csv: This file contains filenames (
filenamecolumn) and their corresponding captions (captionscolumn) for training your image captioning model.- test.csv: The test set is included in this file, which contains a similar structure as that of
train.csv. The purpose of this file is to evaluate your trained models on unseen data.- valid.csv: This validation set provides images with their respective filenames (
filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.Getting Started
To begin utilizing this dataset effectively, follow these steps:
- Extract the zip file containing all relevant data files onto your local machine or cloud environment.
- Familiarize yourself with each CSV file's structure:
train.csv,test.csv, andvalid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).- Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
- Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
- Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
- Split the data into training, validation, and test sets according to your experimental design requirements.
- Use appropriate algorithms and techniques to train your image captioning models on the provided data.
Enhancing Model Performance
To optimize model performance using this dataset, consider these tips:
- Explore different architectures and pre-trained models specifically designed for image captioning tasks.
- Experiment with various natural language
- Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
- Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
- Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MathOverflow Duplicate Question Retrieval
The task of Duplicate Question Retrieval involves retrieving questions that are duplicates of a given input question. We construct our dataset using the Mathematics Stack Exchange Data Dump (2024-09-30) https://archive.org/download/stackexchange_20240930/stackexchange_20240930/mathoverflow.net.7z
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.
How to use
See notebook Wikipedia Q&A Retrieval-Semantic Search
Installing the package
!pip install sentence-transformers==2.3.1
The converting process
Facebook
TwitterDataset Card for English quotes
I-Dataset Summary
english_quotes is a dataset of all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.
II-Supported Tasks and Leaderboards
Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/english_quotes.
Facebook
TwitterUpdate README.md PhoneDB Device Specifications Dataset Overview This dataset contains structured information about smartphones and mobile devices, parsed from PhoneDB.net Each record provides a detailed set of specifications for a single device, including hardware, software, dimensions, cameras, sensors, connectivity, and more. The dataset is provided in JSON format for easy use in data analysis, machine learning, and information retrieval tasks. Each entry contains: title → Full device name… See the full description on the dataset page: https://huggingface.co/datasets/Nadirova/Phone_SpecsDataset_25K.
Facebook
TwitterAbout
ActivityNet Captions contains 20K long-form videos (180s as average length) from YouTube and 100K captions. Most of the videos contain over 3 annotated events. We follow the existing works to concatenate multiple short temporal descriptions into long sentences and evaluate ‘paragraph-to-video’ retrieval on this benchmark. We adopt the official split:
Train: 10,009 videos, 10,009 captions (concatenate from 37,421 short captions)
Test (Val1): 4,917 videos, 4,917 captions… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/ActivityNet_Captions.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
[!CAUTION] This dataset will not be updated. It corresponds to the last available public snapshot of the data, retrieved on July 28th, 2025.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How To Load from HuggingFace Hub
Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:
from the_well.benchmark.data import WellDataModule
datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "active_matter_cloud_optimized", ) train_dataloader = datamodule.train_dataloader()
for batch in dataloader: # Process training batch… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/active_matter.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterEmploying the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB
logger = logging.getLogger(name)
model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.