18 datasets found

h
CodeSearchNet
huggingface.co
Updated Nov 9, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CoIR (2025). CodeSearchNet [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 9, 2025
Dataset authored and provided by
CoIR
Description
Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

logger = logging.getLogger(name)

model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.
MLRS Net
kaggle.com
zip
Updated Aug 4, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Keesari Vigneshwar Reddy (2024). MLRS Net [Dataset]. https://www.kaggle.com/datasets/vigneshwar472/mlrs-net
Explore at:
zip(2650144873 bytes)Available download formats
Dataset updated
Aug 4, 2024
Authors
Keesari Vigneshwar Reddy
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
MLRSNet is a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

The entire dataset is available as a huggingface dataset. In the form of Splits - https://huggingface.co/datasets/vigneshwar472/MLRS-Net-for-modelling In the form of Categories - https://huggingface.co/datasets/vigneshwar472/MLRS-Net

The 60 predefined class labels are aiplane, airport, bare soil, baseball diamond, basketball court, beach, bridge, buildings, cars, chaparral, cloud, containers, crosswalk, dense residential area, desert, dock, factory, field, football field, forest, freeway, golf course, grass, greenhouse, gully, habor, intersection, island, lake, mobile home, mountain, overpass, park, parking lot, parkway, pavement, railway, railway station, river, road, roundabout, runway, sand, sea, ships, snow, snowberg, sparse residential area, stadium, swimming pool, tanks, tennis court, terrace, track, trail, transmission tower, trees, water, wetland, wind turbine

The explanation of each directory can be found on data explorer.

Column descriptor of meta data files: The meta data files are available in labels directory and splits directory. Every metadata file has two colums - 1. image_id : id of images with which a user can fetch the corresponding .jpg file from corresponding folder 2. labels : all the labels associated with the image

Citation

Qi, Xiaoman; Zhu, Panpan; Wang, Yuebin; Zhang, Liqiang; Peng, Junhuan; Wu, Mengfan; Chen, Jialong; Zhao, Xudong; Zang, Ning; Mathiopoulos, P.Takis (2021), “MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding”, Mendeley Data, V3, doi: 10.17632/7j9bv9vwsx.3
h
NanoArguAnaRetrieval
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark, NanoArguAnaRetrieval [Dataset]. https://huggingface.co/datasets/mteb/NanoArguAnaRetrieval
Explore at:
Dataset authored and provided by
Massive Text Embedding Benchmark
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NanoArguAnaRetrieval An MTEB dataset Massive Text Embedding Benchmark

NanoArguAna is a smaller subset of ArguAna, a dataset for argument retrieval in debate contexts.

Task category t2t

Domains Medical, Written

Referencehttp://argumentation.bplaced.net/arguana/data

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_tasks(["NanoArguAnaRetrieval"]) evaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/NanoArguAnaRetrieval.
h
CodeSearchNetRetrieval
huggingface.co
Updated May 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2025). CodeSearchNetRetrieval [Dataset]. https://huggingface.co/datasets/mteb/CodeSearchNetRetrieval
Explore at:
Dataset updated
May 11, 2025
Dataset authored and provided by
Massive Text Embedding Benchmark
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
CodeSearchNetRetrieval An MTEB dataset Massive Text Embedding Benchmark

The dataset is a collection of code snippets and their corresponding natural language queries. The task is to retrieve the most relevant code snippet for a given query.

Task category t2t

Domains Programming, Written

Reference https://huggingface.co/datasets/code_search_net/

Source datasets:

code-search-net/code_search_net

How to evaluate on this task

You can evaluate an embedding… See the full description on the dataset page: https://huggingface.co/datasets/mteb/CodeSearchNetRetrieval.
h
arguana
huggingface.co
Updated Mar 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Massive Text Embedding Benchmark (2024). arguana [Dataset]. https://huggingface.co/datasets/mteb/arguana
Explore at:
Dataset updated
Mar 2, 2024
Dataset authored and provided by
Massive Text Embedding Benchmark
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
ArguAna An MTEB dataset Massive Text Embedding Benchmark

NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval

Task category t2t

Domains Medical, Written

Reference http://argumentation.bplaced.net/arguana/data

How to evaluate on this task

You can evaluate an embedding model on this dataset using the following code: import mteb

task = mteb.get_tasks(["ArguAna"]) evaluator = mteb.MTEB(task)

model = mteb.get_model(YOUR_MODEL)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/arguana.
sentence-transformers v1.20
kaggle.com
zip
Updated May 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Levent Serinol (2022). sentence-transformers v1.20 [Dataset]. https://www.kaggle.com/landfallmotto/sentencetransformers-v120
Explore at:
zip(15131732 bytes)Available download formats
Dataset updated
May 19, 2022
Authors
Levent Serinol
Description
Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.

You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net.

https://huggingface.co/sentence-transformers

The following publications are integrated in this framework: - Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) - Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020) - Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021) - The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020) - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021)
h
MRAG-Bench
huggingface.co
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
mragbenchanonymous (2024). MRAG-Bench [Dataset]. https://huggingface.co/datasets/mragbenchanonymous/MRAG-Bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 21, 2024
Authors
mragbenchanonymous
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ICLR 2025 Submision 9148

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models This is an anonymous repo for openreview https://openreview.net/forum?id=Usklli4gMc Dataset Description

The dataset contains the following fields:

Field Name Description

id Unique identifier for the example

aspect Aspect type for the example

scenario The type of scenario associated with the entry

image Contains image data in byte… See the full description on the dataset page: https://huggingface.co/datasets/mragbenchanonymous/MRAG-Bench.
Extended GigaMIDI v2.0.0
kaggle.com
zip
Updated Oct 25, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jason Myers (2025). Extended GigaMIDI v2.0.0 [Dataset]. https://www.kaggle.com/datasets/json2007or8/extended-gigamidi-v2-0-0
Explore at:
zip(6074087264 bytes)Available download formats
Dataset updated
Oct 25, 2025
Authors
Jason Myers
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
- From Metacreation Lab at HuggingFace -

https://huggingface.co/datasets/Metacreation/GigaMIDI

The Extended GigaMIDI Dataset Summary

We present the extended GigaMIDI dataset [https://huggingface.co/datasets/Metacreation/GigaMIDI/viewer/v2.0.0], a large-scale symbolic music collection comprising over 2.1 million unique MIDI files with detailed annotations for music loop detection. Expanding on its predecessor, this release introduces a novel expressive loop detection method that captures performance nuances such as microtiming and dynamic variation, essential for advanced generative music modelling. Our method extends previous approaches, which were limited to strictly quantized, non-expressive tracks, by employing the Note Onset Median Metric Level (NOMML) heuristic to distinguish expressive from non-expressive material. This enables robust loop detection across a broader spectrum of MIDI data. Our loop detection method reveals more than 9.2 million non-expressive loops spanning all General MIDI instruments, alongside 2.3 million expressive loops identified through our new method. As the largest resource of its kind, the extended GigaMIDI dataset provides a strong foundation for developing models that synthesize structurally coherent and expressively rich musical loops. As a use case, we leverage this dataset to train an expressive multitrack symbolic music loop generation model using the MIDI-GPT system, resulting in the creation of a synthetic loop dataset.

Dataset Description

Repository: https://github.com/Metacreation-Lab/GigaMIDI-Dataset

Original GigaMIDI Dataset Paper: https://transactions.ismir.net/articles/10.5334/tismir.203

Point of Contact: Keon Ju Maverick Lee, keon_maverick@sfu.ca

Dataset Curators

Main curator: Keon Ju Maverick Lee

Assistance: Jeff Ens, Sara Adkins, Nathan Fradet, Pedro Sarmento, Mathieu Barthet, Phillip Long, Paul Triana

Research Director: Philippe Pasquier

Note: The GigaMIDI dataset is designed for continuous growth, with new subsets added and updated over time to ensure its ongoing expansion and relevance.

Licensing Information

The dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This license permits users to share, adapt, and utilize the dataset exclusively for non-commercial purposes, including research and educational applications, provided that proper attribution is given to the original creators. By adhering to the terms of CC BY-NC 4.0, users ensure the dataset's responsible use while fostering its accessibility for academic and non-commercial endeavors.

Citation/Reference

You agree to use the GigaMIDI dataset only for non-commercial research or education without infringing copyright laws or causing harm to the creative rights of artists, creators, or musicians.

Currently, the extended GigaMIDI dataset is being under review at NeurIPS 2025 Dataset Track for Creative AI.

If you use the GigaMIDI dataset or any part of this project, please cite the following paper: https://transactions.ismir.net/articles/10.5334/tismir.203

@article{lee2025gigamidi, title={The GigaMIDI Dataset with Features for Expressive Music Performance Detection}, author={Lee, Keon Ju Maverick and Ens, Jeff and Adkins, Sara and Sarmento, Pedro and Barthet, Mathieu and Pasquier, Philippe}, journal={Transactions of the International Society for Music Information Retrieval (TISMIR)}, volume={8}, number={1}, pages={1--19}, year={2025} }
Synth-Long-SFT32K
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cerebras (2025). Synth-Long-SFT32K [Dataset]. https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 1, 2025
Dataset authored and provided by
Cerebrashttp://cerebras.ai/
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Information

This repository contains augmented versions of several datasets:

Synthetic-ConvQA NarrativeQA RAG-TGE

For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.

Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.
h
PatternNet
huggingface.co
opendatalab.com
Updated Nov 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julien BLANCHON (2025). PatternNet [Dataset]. https://huggingface.co/datasets/blanchon/PatternNet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 18, 2025
Authors
Julien BLANCHON
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
PatternNet

The PatternNet dataset is a dataset for remote sensing scene classification and image retrieval.

Paper: https://arxiv.org/abs/1703.06339 Homepage: https://sites.google.com/view/zhouwx/dataset

Description

PatternNet is a large-scale high-resolution remote sensing dataset collected for remote sensing image retrieval. There are 38 classes and each class has 800 images of size 256×256 pixels. The images in PatternNet are collected from Google Earth… See the full description on the dataset page: https://huggingface.co/datasets/blanchon/PatternNet.
RSICD Image Caption Dataset
kaggle.com
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). RSICD Image Caption Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/rsicd-image-caption-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 6, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
RSICD Image Caption Dataset

RSICD Image Caption Dataset

By Arto (From Huggingface) [source]

About this dataset

The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.

Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.

Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.

Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.

Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »

How to use the dataset

Overview of the Dataset

The dataset consists of three primary files: train.csv, test.csv, and valid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.

Understanding the Files

train.csv: This file contains filenames (filename column) and their corresponding captions (captions column) for training your image captioning model.

test.csv: The test set is included in this file, which contains a similar structure as that of train.csv. The purpose of this file is to evaluate your trained models on unseen data.

valid.csv: This validation set provides images with their respective filenames (filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.

Getting Started

To begin utilizing this dataset effectively, follow these steps:

Extract the zip file containing all relevant data files onto your local machine or cloud environment.

Familiarize yourself with each CSV file's structure: train.csv, test.csv, and valid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).

Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).

Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.

Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.

Split the data into training, validation, and test sets according to your experimental design requirements.

Use appropriate algorithms and techniques to train your image captioning models on the provided data.

Enhancing Model Performance

To optimize model performance using this dataset, consider these tips:

Explore different architectures and pre-trained models specifically designed for image captioning tasks.

Experiment with various natural language

Research Ideas

Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.

Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.

Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
h
modup
huggingface.co
Updated May 16, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Haocheng Ju (2025). modup [Dataset]. https://huggingface.co/datasets/hcju/modup
Explore at:
Dataset updated
May 16, 2025
Authors
Haocheng Ju
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MathOverflow Duplicate Question Retrieval

The task of Duplicate Question Retrieval involves retrieving questions that are duplicates of a given input question. We construct our dataset using the Mathematics Stack Exchange Data Dump (2024-09-30) https://archive.org/download/stackexchange_20240930/stackexchange_20240930/mathoverflow.net.7z
h
simple_english_wikipedia
huggingface.co
Updated Feb 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 9, 2024
Authors
Bowen Li
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

How to use

See notebook Wikipedia Q&A Retrieval-Semantic Search

Installing the package

!pip install sentence-transformers==2.3.1

The converting process

the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.
h
english_quotes
huggingface.co
opendatalab.com
Updated Dec 19, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abir ELTAIEF (2021). english_quotes [Dataset]. http://doi.org/10.57967/hf/1053
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1053
Dataset updated
Dec 19, 2021
Authors
Abir ELTAIEF
Description
Dataset Card for English quotes

I-Dataset Summary

english_quotes is a dataset of all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.

II-Supported Tasks and Leaderboards

Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/english_quotes.
h
Phone_SpecsDataset_25K
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mehriban, Phone_SpecsDataset_25K [Dataset]. https://huggingface.co/datasets/Nadirova/Phone_SpecsDataset_25K
Explore at:
Authors
Mehriban
Description
Update README.md PhoneDB Device Specifications Dataset Overview This dataset contains structured information about smartphones and mobile devices, parsed from PhoneDB.net Each record provides a detailed set of specifications for a single device, including hardware, software, dimensions, cameras, sensors, connectivity, and more. The dataset is provided in JSON format for easy use in data analysis, machine learning, and information retrieval tasks. Each entry contains: title → Full device name… See the full description on the dataset page: https://huggingface.co/datasets/Nadirova/Phone_SpecsDataset_25K.
h
ActivityNet_Captions
huggingface.co
Updated Mar 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fanheng Kong (2025). ActivityNet_Captions [Dataset]. https://huggingface.co/datasets/friedrichor/ActivityNet_Captions
Explore at:
Dataset updated
Mar 4, 2025
Authors
Fanheng Kong
Description
About

ActivityNet Captions contains 20K long-form videos (180s as average length) from YouTube and 100K captions. Most of the videos contain over 3 annotated events. We follow the existing works to concatenate multiple short temporal descriptions into long sentences and evaluate ‘paragraph-to-video’ retrieval on this benchmark. We adopt the official split:

Train: 10,009 videos, 10,009 captions (concatenate from 37,421 short captions)
Test (Val1): 4,917 videos, 4,917 captions… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/ActivityNet_Captions.
h
links-between-paper-and-code
huggingface.co
Updated Aug 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
paperswithcode Archive (2025). links-between-paper-and-code [Dataset]. https://huggingface.co/datasets/pwc-archive/links-between-paper-and-code
Explore at:
Dataset updated
Aug 13, 2025
Dataset authored and provided by
paperswithcode Archive
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
[!CAUTION] This dataset will not be updated. It corresponds to the last available public snapshot of the data, retrieved on July 28th, 2025.
h
active_matter
huggingface.co
Updated Oct 25, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Polymathic AI (2024). active_matter [Dataset]. https://huggingface.co/datasets/polymathic-ai/active_matter
Explore at:
Dataset updated
Oct 25, 2024
Dataset authored and provided by
Polymathic AI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How To Load from HuggingFace Hub

Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

from the_well.benchmark.data import WellDataModule

The following line may take a couple of minutes to instantiate the datamodule

datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "active_matter_cloud_optimized", ) train_dataloader = datamodule.train_dataloader()

for batch in dataloader: # Process training batch… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/active_matter.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

CoIR (2025). CodeSearchNet [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet

CodeSearchNet

CoIR-Retrieval/CodeSearchNet

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 9, 2025

Dataset authored and provided by

CoIR

Description

Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

logger = logging.getLogger(name)

model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.

Clear search

Close search

Google apps

Main menu

CodeSearchNet

MLRS Net

Citation

NanoArguAnaRetrieval

CodeSearchNetRetrieval

arguana

sentence-transformers v1.20

Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

MRAG-Bench

Extended GigaMIDI v2.0.0

- From Metacreation Lab at HuggingFace -

The Extended GigaMIDI Dataset Summary

Dataset Description

Dataset Curators

Licensing Information

Citation/Reference

Synth-Long-SFT32K

PatternNet

RSICD Image Caption Dataset

RSICD Image Caption Dataset

RSICD Image Caption Dataset

About this dataset

How to use the dataset

Overview of the Dataset

Understanding the Files

Getting Started

Enhancing Model Performance

Research Ideas

modup

simple_english_wikipedia

the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

english_quotes

Phone_SpecsDataset_25K

ActivityNet_Captions

links-between-paper-and-code

active_matter

The following line may take a couple of minutes to instantiate the datamodule

CodeSearchNetSee More Versions

CoIR-Retrieval/CodeSearchNet

CodeSearchNet