18 datasets found
  1. h

    CodeSearchNet

    • huggingface.co
    Updated Nov 9, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CoIR (2025). CodeSearchNet [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 9, 2025
    Dataset authored and provided by
    CoIR
    Description

    Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

    logger = logging.getLogger(name)

    model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.

  2. MLRS Net

    • kaggle.com
    zip
    Updated Aug 4, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Keesari Vigneshwar Reddy (2024). MLRS Net [Dataset]. https://www.kaggle.com/datasets/vigneshwar472/mlrs-net
    Explore at:
    zip(2650144873 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    Keesari Vigneshwar Reddy
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    MLRSNet is a multi-label high spatial resolution remote sensing dataset for semantic scene understanding. It provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.

    The entire dataset is available as a huggingface dataset. In the form of Splits - https://huggingface.co/datasets/vigneshwar472/MLRS-Net-for-modelling In the form of Categories - https://huggingface.co/datasets/vigneshwar472/MLRS-Net

    The 60 predefined class labels are aiplane, airport, bare soil, baseball diamond, basketball court, beach, bridge, buildings, cars, chaparral, cloud, containers, crosswalk, dense residential area, desert, dock, factory, field, football field, forest, freeway, golf course, grass, greenhouse, gully, habor, intersection, island, lake, mobile home, mountain, overpass, park, parking lot, parkway, pavement, railway, railway station, river, road, roundabout, runway, sand, sea, ships, snow, snowberg, sparse residential area, stadium, swimming pool, tanks, tennis court, terrace, track, trail, transmission tower, trees, water, wetland, wind turbine

    The explanation of each directory can be found on data explorer.

    Column descriptor of meta data files: The meta data files are available in labels directory and splits directory. Every metadata file has two colums - 1. image_id : id of images with which a user can fetch the corresponding .jpg file from corresponding folder 2. labels : all the labels associated with the image

    Citation

    Qi, Xiaoman; Zhu, Panpan; Wang, Yuebin; Zhang, Liqiang; Peng, Junhuan; Wu, Mengfan; Chen, Jialong; Zhao, Xudong; Zang, Ning; Mathiopoulos, P.Takis (2021), 
    “MLRSNet: A Multi-label High Spatial Resolution Remote Sensing Dataset for Semantic Scene Understanding”, Mendeley Data, V3, doi: 10.17632/7j9bv9vwsx.3
    
  3. h

    NanoArguAnaRetrieval

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark, NanoArguAnaRetrieval [Dataset]. https://huggingface.co/datasets/mteb/NanoArguAnaRetrieval
    Explore at:
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NanoArguAnaRetrieval An MTEB dataset Massive Text Embedding Benchmark

    NanoArguAna is a smaller subset of ArguAna, a dataset for argument retrieval in debate contexts.

    Task category t2t

    Domains Medical, Written

    Referencehttp://argumentation.bplaced.net/arguana/data

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["NanoArguAnaRetrieval"]) evaluator =… See the full description on the dataset page: https://huggingface.co/datasets/mteb/NanoArguAnaRetrieval.

  4. h

    CodeSearchNetRetrieval

    • huggingface.co
    Updated May 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2025). CodeSearchNetRetrieval [Dataset]. https://huggingface.co/datasets/mteb/CodeSearchNetRetrieval
    Explore at:
    Dataset updated
    May 11, 2025
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    CodeSearchNetRetrieval An MTEB dataset Massive Text Embedding Benchmark

    The dataset is a collection of code snippets and their corresponding natural language queries. The task is to retrieve the most relevant code snippet for a given query.

    Task category t2t

    Domains Programming, Written

    Reference https://huggingface.co/datasets/code_search_net/

    Source datasets:

    code-search-net/code_search_net

      How to evaluate on this task
    

    You can evaluate an embedding… See the full description on the dataset page: https://huggingface.co/datasets/mteb/CodeSearchNetRetrieval.

  5. h

    arguana

    • huggingface.co
    Updated Mar 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Massive Text Embedding Benchmark (2024). arguana [Dataset]. https://huggingface.co/datasets/mteb/arguana
    Explore at:
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Massive Text Embedding Benchmark
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    ArguAna An MTEB dataset Massive Text Embedding Benchmark

    NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval

    Task category t2t

    Domains Medical, Written

    Reference http://argumentation.bplaced.net/arguana/data

      How to evaluate on this task
    

    You can evaluate an embedding model on this dataset using the following code: import mteb

    task = mteb.get_tasks(["ArguAna"]) evaluator = mteb.MTEB(task)

    model = mteb.get_model(YOUR_MODEL)… See the full description on the dataset page: https://huggingface.co/datasets/mteb/arguana.

  6. sentence-transformers v1.20

    • kaggle.com
    zip
    Updated May 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Levent Serinol (2022). sentence-transformers v1.20 [Dataset]. https://www.kaggle.com/landfallmotto/sentencetransformers-v120
    Explore at:
    zip(15131732 bytes)Available download formats
    Dataset updated
    May 19, 2022
    Authors
    Levent Serinol
    Description

    Sentence Transformers: Multilingual Sentence, Paragraph, and Image Embeddings using BERT & Co.

    SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in our paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.

    This framework provides an easy method to compute dense vector representations for sentences, paragraphs, and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and achieve state-of-the-art performance in various task. Text is embedding in vector space such that similar text is close and can efficiently be found using cosine similarity.

    You can use this framework to compute sentence / text embeddings for more than 100 languages. These embeddings can then be compared e.g. with cosine-similarity to find sentences with a similar meaning. This can be useful for semantic textual similar, semantic search, or paraphrase mining.

    We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

    Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

    For the full documentation, see www.SBERT.net.

    https://huggingface.co/sentence-transformers

    The following publications are integrated in this framework: - Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (EMNLP 2019) - Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation (EMNLP 2020) - Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks (NAACL 2021) - The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes (arXiv 2020) - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning (arXiv 2021) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (arXiv 2021)

  7. h

    MRAG-Bench

    • huggingface.co
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    mragbenchanonymous (2024). MRAG-Bench [Dataset]. https://huggingface.co/datasets/mragbenchanonymous/MRAG-Bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 21, 2024
    Authors
    mragbenchanonymous
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ICLR 2025 Submision 9148

      MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
    
    
    
    
    
      This is an anonymous repo for openreview https://openreview.net/forum?id=Usklli4gMc
    
    
    
    
    
      Dataset Description
    

    The dataset contains the following fields:

    Field Name Description

    id Unique identifier for the example

    aspect Aspect type for the example

    scenario The type of scenario associated with the entry

    image Contains image data in byte… See the full description on the dataset page: https://huggingface.co/datasets/mragbenchanonymous/MRAG-Bench.

  8. Extended GigaMIDI v2.0.0

    • kaggle.com
    zip
    Updated Oct 25, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jason Myers (2025). Extended GigaMIDI v2.0.0 [Dataset]. https://www.kaggle.com/datasets/json2007or8/extended-gigamidi-v2-0-0
    Explore at:
    zip(6074087264 bytes)Available download formats
    Dataset updated
    Oct 25, 2025
    Authors
    Jason Myers
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    - From Metacreation Lab at HuggingFace -

    https://huggingface.co/datasets/Metacreation/GigaMIDI

    The Extended GigaMIDI Dataset Summary

    We present the extended GigaMIDI dataset [https://huggingface.co/datasets/Metacreation/GigaMIDI/viewer/v2.0.0], a large-scale symbolic music collection comprising over 2.1 million unique MIDI files with detailed annotations for music loop detection. Expanding on its predecessor, this release introduces a novel expressive loop detection method that captures performance nuances such as microtiming and dynamic variation, essential for advanced generative music modelling. Our method extends previous approaches, which were limited to strictly quantized, non-expressive tracks, by employing the Note Onset Median Metric Level (NOMML) heuristic to distinguish expressive from non-expressive material. This enables robust loop detection across a broader spectrum of MIDI data. Our loop detection method reveals more than 9.2 million non-expressive loops spanning all General MIDI instruments, alongside 2.3 million expressive loops identified through our new method. As the largest resource of its kind, the extended GigaMIDI dataset provides a strong foundation for developing models that synthesize structurally coherent and expressively rich musical loops. As a use case, we leverage this dataset to train an expressive multitrack symbolic music loop generation model using the MIDI-GPT system, resulting in the creation of a synthetic loop dataset.

    Dataset Description

    Dataset Curators

    Main curator: Keon Ju Maverick Lee

    Assistance: Jeff Ens, Sara Adkins, Nathan Fradet, Pedro Sarmento, Mathieu Barthet, Phillip Long, Paul Triana

    Research Director: Philippe Pasquier

    Note: The GigaMIDI dataset is designed for continuous growth, with new subsets added and updated over time to ensure its ongoing expansion and relevance.

    Licensing Information

    The dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. This license permits users to share, adapt, and utilize the dataset exclusively for non-commercial purposes, including research and educational applications, provided that proper attribution is given to the original creators. By adhering to the terms of CC BY-NC 4.0, users ensure the dataset's responsible use while fostering its accessibility for academic and non-commercial endeavors.

    Citation/Reference

    You agree to use the GigaMIDI dataset only for non-commercial research or education without infringing copyright laws or causing harm to the creative rights of artists, creators, or musicians.

    Currently, the extended GigaMIDI dataset is being under review at NeurIPS 2025 Dataset Track for Creative AI.

    If you use the GigaMIDI dataset or any part of this project, please cite the following paper: https://transactions.ismir.net/articles/10.5334/tismir.203

    @article{lee2025gigamidi,
     title={The GigaMIDI Dataset with Features for Expressive Music Performance Detection},
     author={Lee, Keon Ju Maverick and Ens, Jeff and Adkins, Sara and Sarmento, Pedro and Barthet, Mathieu and Pasquier, Philippe},
     journal={Transactions of the International Society for Music Information Retrieval (TISMIR)},
     volume={8},
     number={1},
     pages={1--19},
     year={2025}
    }
    
  9. Synth-Long-SFT32K

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cerebras (2025). Synth-Long-SFT32K [Dataset]. https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 1, 2025
    Dataset authored and provided by
    Cerebrashttp://cerebras.ai/
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Information

    This repository contains augmented versions of several datasets:

    Synthetic-ConvQA NarrativeQA RAG-TGE

    For more information, refer to our blogpost. We used these datasets for long instruction-following training. The maximal sequence length of the examples is 32,768.

    Synthetic-ConvQA with RAFT-style augmentation. Our synthetic long-context data is based on an approach introduced by [Zhang et al., 2024] called Retrieval Augmented Fine-Tuning (RAFT). For each… See the full description on the dataset page: https://huggingface.co/datasets/cerebras/Synth-Long-SFT32K.

  10. h

    PatternNet

    • huggingface.co
    • opendatalab.com
    Updated Nov 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julien BLANCHON (2025). PatternNet [Dataset]. https://huggingface.co/datasets/blanchon/PatternNet
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 18, 2025
    Authors
    Julien BLANCHON
    License

    https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/

    Description

    PatternNet

    The PatternNet dataset is a dataset for remote sensing scene classification and image retrieval.

    Paper: https://arxiv.org/abs/1703.06339 Homepage: https://sites.google.com/view/zhouwx/dataset

      Description
    

    PatternNet is a large-scale high-resolution remote sensing dataset collected for remote sensing image retrieval. There are 38 classes and each class has 800 images of size 256×256 pixels. The images in PatternNet are collected from Google Earth… See the full description on the dataset page: https://huggingface.co/datasets/blanchon/PatternNet.

  11. RSICD Image Caption Dataset

    • kaggle.com
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). RSICD Image Caption Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/rsicd-image-caption-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    RSICD Image Caption Dataset

    RSICD Image Caption Dataset

    By Arto (From Huggingface) [source]

    About this dataset

    The train.csv file contains a list of image filenames, captions, and the actual images used for training the image captioning models. Similarly, the test.csv file includes a separate set of image filenames, captions, and images specifically designated for testing the accuracy and performance of the trained models.

    Furthermore, the valid.csv file contains a unique collection of image filenames with their respective captions and images that serve as an independent validation set to evaluate the models' capabilities accurately.

    Each entry in these CSV files includes both a filename string that indicates the name or identifier of an image file stored in another location or directory. Additionally,** each entry also provides a list (or multiple rows) o**f strings representing written descriptions or captions describing each respective image given its filename.

    Considering these details about this dataset's structure, it can be immensely valuable to researchers, developers, and enthusiasts working on developing innovative computer vision algorithms such as automatic text generation based on visual content analysis. Whether it's training machine learning models to automatically generate relevant captions based on new unseen images or evaluating existing systems' performance against diverse criteria.

    Stay updated with cutting-edge research trends by leveraging this comprehensive dataset containing not only captio**ns but also corresponding imag**es across different sets specifically designed to cater to varied purposes within computer vision tasks. »

    How to use the dataset

    Overview of the Dataset

    The dataset consists of three primary files: train.csv, test.csv, and valid.csv. These files contain information about image filenames and their respective captions. Each file includes multiple captions for each image to support diverse training techniques.

    Understanding the Files

    • train.csv: This file contains filenames (filename column) and their corresponding captions (captions column) for training your image captioning model.
    • test.csv: The test set is included in this file, which contains a similar structure as that of train.csv. The purpose of this file is to evaluate your trained models on unseen data.
    • valid.csv: This validation set provides images with their respective filenames (filename) and captions (captions). It allows you to fine-tune your models based on performance during evaluation.

    Getting Started

    To begin utilizing this dataset effectively, follow these steps:

    • Extract the zip file containing all relevant data files onto your local machine or cloud environment.
    • Familiarize yourself with each CSV file's structure: train.csv, test.csv, and valid.csv. Understand how information like filename(s) (filename) corresponds with its respective caption(s) (captions).
    • Depending on your specific use case or research goals, determine which portion(s) of the dataset you wish to work with (e.g., only train or train+validation).
    • Load the dataset into your preferred programming environment or machine learning framework, ensuring you have the necessary dependencies installed.
    • Preprocess the dataset as needed, such as resizing images to a specific dimension or encoding captions for model training purposes.
    • Split the data into training, validation, and test sets according to your experimental design requirements.
    • Use appropriate algorithms and techniques to train your image captioning models on the provided data.

    Enhancing Model Performance

    To optimize model performance using this dataset, consider these tips:

    • Explore different architectures and pre-trained models specifically designed for image captioning tasks.
    • Experiment with various natural language

    Research Ideas

    • Image Captioning: This dataset can be used to train and evaluate image captioning models. The captions can be used as target labels for training, and the images can be paired with the captions to generate descriptive captions for test images.
    • Image Retrieval: The dataset can be used for image retrieval tasks where given a query caption, the model needs to retrieve the images that best match the description. This can be useful in applications such as content-based image search.
    • Natural Language Processing: The dataset can also be used for natural language processing tasks such as text generation or machine translation. The captions in this dataset are descriptive ...
  12. h

    modup

    • huggingface.co
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haocheng Ju (2025). modup [Dataset]. https://huggingface.co/datasets/hcju/modup
    Explore at:
    Dataset updated
    May 16, 2025
    Authors
    Haocheng Ju
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MathOverflow Duplicate Question Retrieval

    The task of Duplicate Question Retrieval involves retrieving questions that are duplicates of a given input question. We construct our dataset using the Mathematics Stack Exchange Data Dump (2024-09-30) https://archive.org/download/stackexchange_20240930/stackexchange_20240930/mathoverflow.net.7z

  13. h

    simple_english_wikipedia

    • huggingface.co
    Updated Feb 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bowen Li (2024). simple_english_wikipedia [Dataset]. https://huggingface.co/datasets/aisuko/simple_english_wikipedia
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2024
    Authors
    Bowen Li
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Only for the reaseaching usage. The original data from http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz. We use nq_distilbert-base-v1 model encode all the data to the PyTorch Tensors. And normalize the embeddings by using sentence_transformers.util.normalize_embeddings.

      How to use
    

    See notebook Wikipedia Q&A Retrieval-Semantic Search

      Installing the package
    

    !pip install sentence-transformers==2.3.1

      The converting process
    

    the whole process takes… See the full description on the dataset page: https://huggingface.co/datasets/aisuko/simple_english_wikipedia.

  14. h

    english_quotes

    • huggingface.co
    • opendatalab.com
    Updated Dec 19, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abir ELTAIEF (2021). english_quotes [Dataset]. http://doi.org/10.57967/hf/1053
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 19, 2021
    Authors
    Abir ELTAIEF
    Description

    Dataset Card for English quotes

      I-Dataset Summary
    

    english_quotes is a dataset of all the quotes retrieved from goodreads quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.

      II-Supported Tasks and Leaderboards
    

    Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of… See the full description on the dataset page: https://huggingface.co/datasets/Abirate/english_quotes.

  15. h

    Phone_SpecsDataset_25K

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mehriban, Phone_SpecsDataset_25K [Dataset]. https://huggingface.co/datasets/Nadirova/Phone_SpecsDataset_25K
    Explore at:
    Authors
    Mehriban
    Description

    Update README.md PhoneDB Device Specifications Dataset Overview This dataset contains structured information about smartphones and mobile devices, parsed from PhoneDB.net Each record provides a detailed set of specifications for a single device, including hardware, software, dimensions, cameras, sensors, connectivity, and more. The dataset is provided in JSON format for easy use in data analysis, machine learning, and information retrieval tasks. Each entry contains: title → Full device name… See the full description on the dataset page: https://huggingface.co/datasets/Nadirova/Phone_SpecsDataset_25K.

  16. h

    ActivityNet_Captions

    • huggingface.co
    Updated Mar 4, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fanheng Kong (2025). ActivityNet_Captions [Dataset]. https://huggingface.co/datasets/friedrichor/ActivityNet_Captions
    Explore at:
    Dataset updated
    Mar 4, 2025
    Authors
    Fanheng Kong
    Description

    About

    ActivityNet Captions contains 20K long-form videos (180s as average length) from YouTube and 100K captions. Most of the videos contain over 3 annotated events. We follow the existing works to concatenate multiple short temporal descriptions into long sentences and evaluate ‘paragraph-to-video’ retrieval on this benchmark. We adopt the official split:

    Train: 10,009 videos, 10,009 captions (concatenate from 37,421 short captions)
    Test (Val1): 4,917 videos, 4,917 captions… See the full description on the dataset page: https://huggingface.co/datasets/friedrichor/ActivityNet_Captions.

  17. h

    links-between-paper-and-code

    • huggingface.co
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    paperswithcode Archive (2025). links-between-paper-and-code [Dataset]. https://huggingface.co/datasets/pwc-archive/links-between-paper-and-code
    Explore at:
    Dataset updated
    Aug 13, 2025
    Dataset authored and provided by
    paperswithcode Archive
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    [!CAUTION] This dataset will not be updated. It corresponds to the last available public snapshot of the data, retrieved on July 28th, 2025.

  18. h

    active_matter

    • huggingface.co
    Updated Oct 25, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Polymathic AI (2024). active_matter [Dataset]. https://huggingface.co/datasets/polymathic-ai/active_matter
    Explore at:
    Dataset updated
    Oct 25, 2024
    Dataset authored and provided by
    Polymathic AI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How To Load from HuggingFace Hub

    Be sure to have the_well installed (pip install the_well) Use the WellDataModule to retrieve data as follows:

    from the_well.benchmark.data import WellDataModule

    The following line may take a couple of minutes to instantiate the datamodule

    datamodule = WellDataModule( "hf://datasets/polymathic-ai/", "active_matter_cloud_optimized", ) train_dataloader = datamodule.train_dataloader()

    for batch in dataloader: # Process training batch… See the full description on the dataset page: https://huggingface.co/datasets/polymathic-ai/active_matter.

  19. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
CoIR (2025). CodeSearchNet [Dataset]. https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet

CodeSearchNet

CoIR-Retrieval/CodeSearchNet

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 9, 2025
Dataset authored and provided by
CoIR
Description

Employing the MTEB evaluation framework's dataset version, utilize the code below for assessment: import mteb import logging from sentence_transformers import SentenceTransformer from mteb import MTEB

logger = logging.getLogger(name)

model_name = 'intfloat/e5-base-v2' model = SentenceTransformer(model_name) tasks = mteb.get_tasks( tasks=[ "AppsRetrieval", "CodeFeedbackMT", "CodeFeedbackST", "CodeTransOceanContest", "CodeTransOceanDL"… See the full description on the dataset page: https://huggingface.co/datasets/CoIR-Retrieval/CodeSearchNet.

Search
Clear search
Close search
Google apps
Main menu