100+ datasets found
  1. h

    MAP-CC

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). MAP-CC [Dataset]. https://huggingface.co/datasets/m-a-p/MAP-CC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    MAP-CC

    🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.

      Disclaimer
    

    This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.

  2. h

    II-Bench

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). II-Bench [Dataset]. https://huggingface.co/datasets/m-a-p/II-Bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    II-Bench

    🌐 Homepage | 🤗 Paper | 📖 arXiv | 🤗 Dataset | GitHub

      Introduction
    

    II-Bench comprises 1,222 images, each accompanied by 1 to 3 multiple-choice questions, totaling 1,434 questions. II-Bench encompasses images from six distinct domains: Life, Art, Society, Psychology, Environment and Others. It also features a diverse array of image types, including Illustrations, Memes, Posters, Multi-panel Comics, Single-panel Comics, Logos and Paintings. The detailed… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/II-Bench.

  3. h

    SuperGPQA

    • huggingface.co
    Updated May 16, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2013). SuperGPQA [Dataset]. https://huggingface.co/datasets/m-a-p/SuperGPQA
    Explore at:
    Dataset updated
    May 16, 2013
    Dataset authored and provided by
    Multimodal Art Projection
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This repository contains the data presented in SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines.

      Tutorials for submitting to the official leadboard
    

    coming soon

      📜 License
    

    SuperGPQA is a composite dataset that includes both original content and portions of data derived from other sources. The dataset is made available under the Open Data Commons Attribution License (ODC-BY), which asserts no copyright over the underlying content. This means that while the… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SuperGPQA.

  4. h

    CodeFeedback-Filtered-Instruction

    • huggingface.co
    Updated Mar 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). CodeFeedback-Filtered-Instruction [Dataset]. https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

    [🏠Homepage] | [🛠️Code]

      OpenCodeInterpreter
    

    OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction.

  5. h

    neo_sft_phase2

    • huggingface.co
    Updated Jun 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). neo_sft_phase2 [Dataset]. https://huggingface.co/datasets/m-a-p/neo_sft_phase2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. map-test

    • huggingface.co
    Updated Mar 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2023). map-test [Dataset]. https://huggingface.co/datasets/huggingface/map-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    huggingface/map-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    FineFineWeb-sample

    • huggingface.co
    Updated Mar 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). FineFineWeb-sample [Dataset]. https://huggingface.co/datasets/m-a-p/FineFineWeb-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 31, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus

    arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon

      Data Statistics
    

    Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count

    aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539

    agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022

    artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb-sample.

  8. h

    COIG-CQIA

    • huggingface.co
    • opendatalab.com
    Updated Feb 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). COIG-CQIA [Dataset]. https://huggingface.co/datasets/m-a-p/COIG-CQIA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    COIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    欢迎来到COIG-CQIA,COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need, 是一个开源的高质量指令微调数据集,旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据,经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发,使用少量高质量的数据即可让大语言模型学习到人类交互行为,因此在数据构建中我们十分注重数据的来源、质量与多样性,数据集详情请见数据介绍以及我们接下来的论文。 Welcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.

  9. h

    MusicPile

    • huggingface.co
    Updated Mar 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). MusicPile [Dataset]. https://huggingface.co/datasets/m-a-p/MusicPile
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 7, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    🌐 DemoPage | 🤗SFT Dataset | 🤗 Benchmark | 📖 arXiv | 💻 Code | 🤖 Chat Model | 🤖 Base Model

      Dataset Card for MusicPile
    

    MusicPile is the first pretraining corpus for developing musical abilities in large language models. It has 5.17M samples and approximately 4.16B tokens, including web-crawled corpora, encyclopedias, music books, youtube music captions, musical pieces in abc notation, math content, and code. You can easily load it:from datasets import load_dataset ds =… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MusicPile.

  10. h

    emoji-map

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Kamali (2024). emoji-map [Dataset]. https://huggingface.co/datasets/omarkamali/emoji-map
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Omar Kamali
    Description

    📊 Dataset Overview

    The emoji-map dataset, created by omarkamali, contains text data in parquet format. It consists of 10K-100K entries, specifically 5.03k rows. The dataset is available in the train split.

      📁 Data Structure
    

    The dataset includes two main columns: emoji and unicode_description. The emoji column contains various emoji characters, while the unicode_description column provides a textual description of each emoji.

      🔍 Sample Data
    

    Examples from the… See the full description on the dataset page: https://huggingface.co/datasets/omarkamali/emoji-map.

  11. h

    SimpleVQA

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). SimpleVQA [Dataset]. https://huggingface.co/datasets/m-a-p/SimpleVQA
    Explore at:
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SimpleVQA

      SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
    

    Dataset: https://huggingface.co/datasets/m-a-p/SimpleVQA

      Abstract
    

    The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SimpleVQA.

  12. h

    tcd

    • huggingface.co
    Updated Jun 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Restor (2020). tcd [Dataset]. https://huggingface.co/datasets/restor/tcd
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 15, 2020
    Dataset authored and provided by
    Restor
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset Card for OAM-TCD: A globally diverse dataset of high-resolution tree cover maps

    Annotation example in OAM-TCD (ID 1445), RGB image licensed CC BY-4.0, attribution contributors of OIN. Left: RGB aerial image, Middle: annotations shown, distinguished by instance ID, Right: annotations identified by class (blue = tree, orange = canopy)

      Dataset Details
    

    OAM-TCD is a dataset of high-resolution (10 cm/px) tree cover maps with instance-level masks for 280k trees and… See the full description on the dataset page: https://huggingface.co/datasets/restor/tcd.

  13. h

    OpenSatMap

    • huggingface.co
    Updated Oct 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hongbo Zhao (2024). OpenSatMap [Dataset]. https://huggingface.co/datasets/z-hb/OpenSatMap
    Explore at:
    Dataset updated
    Oct 31, 2024
    Authors
    Hongbo Zhao
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    OpenSatMap Dataset Card

      Description
    

    The dataset contains 3,787 high-resolution satellite images with fine-grained annotations, covering diverse geographic locations and popular driving datasets. It can be used for large-scale map construction and downstream tasks like autonomous driving. The images are collected from Google Maps at level 19 resolution (0.3m/pixel) and level 20 resolution (0.15m/pixel), we denote them as OpenSatMap19 and OpenSatMap20, respectively.… See the full description on the dataset page: https://huggingface.co/datasets/z-hb/OpenSatMap.

  14. h

    OmniInstruct

    • huggingface.co
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). OmniInstruct [Dataset]. https://huggingface.co/datasets/m-a-p/OmniInstruct
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    m-a-p/OmniInstruct dataset hosted on Hugging Face and contributed by the HF Datasets community

  15. h

    mia_dataset

    • huggingface.co
    Updated Jun 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cherie Ho (2024). mia_dataset [Dataset]. https://huggingface.co/datasets/cherieho/mia_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 19, 2024
    Authors
    Cherie Ho
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for Map It Anywhere (MIA)

    The Map It Anywhere (MIA) dataset contains map-prediction-ready data curated from public datasets.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    The Map It Anywhere (MIA) dataset contains 1.2 million high quality first-person-view (FPV) and bird's eye view (BEV) map pairs covering 470 squared km, thereby facilitating future map prediction research on generalizability and robustness. The dataset is curated using the MIA data engine… See the full description on the dataset page: https://huggingface.co/datasets/cherieho/mia_dataset.

  16. h

    csgo-maps

    • huggingface.co
    Updated Jul 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Umit Canbolat (2023). csgo-maps [Dataset]. https://huggingface.co/datasets/HOXSEC/csgo-maps
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 14, 2023
    Authors
    Umit Canbolat
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Counter Strike Map Dataset

    This dataset consists of Counter Strike map images along with their corresponding labels and x-y coordinates. The dataset is suitable for image classification tasks and includes the necessary information for each image.

      Dataset Details
    

    Total Images: [1424] Classes: [5] Image Size: [1920x1080] Format: [png]

      Files
    

    The dataset includes the following files:

    maps/train/: This folder contains the Counter Strike map images. The images are… See the full description on the dataset page: https://huggingface.co/datasets/HOXSEC/csgo-maps.

  17. h

    gimp-predator-map-dataset

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Fu (2024). gimp-predator-map-dataset [Dataset]. https://huggingface.co/datasets/debisoft/gimp-predator-map-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    David Fu
    Description

    debisoft/gimp-predator-map-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community

  18. h

    MTT

    • huggingface.co
    Updated Jul 1, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). MTT [Dataset]. https://huggingface.co/datasets/m-a-p/MTT
    Explore at:
    Dataset updated
    Jul 1, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    m-a-p/MTT dataset hosted on Hugging Face and contributed by the HF Datasets community

  19. h

    Data from: OS-Map

    • huggingface.co
    Updated May 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OS-Map (2025). OS-Map [Dataset]. https://huggingface.co/datasets/os-map/OS-Map
    Explore at:
    Dataset updated
    May 28, 2025
    Authors
    OS-Map
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    os-map/OS-Map dataset hosted on Hugging Face and contributed by the HF Datasets community

  20. h

    OpenO1_SFT_ultra_BoN_positvie_reward_v3_N-sample

    • huggingface.co
    Updated Feb 28, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). OpenO1_SFT_ultra_BoN_positvie_reward_v3_N-sample [Dataset]. https://huggingface.co/datasets/m-a-p/OpenO1_SFT_ultra_BoN_positvie_reward_v3_N-sample
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    m-a-p/OpenO1_SFT_ultra_BoN_positvie_reward_v3_N-sample dataset hosted on Hugging Face and contributed by the HF Datasets community

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Multimodal Art Projection (2024). MAP-CC [Dataset]. https://huggingface.co/datasets/m-a-p/MAP-CC

MAP-CC

m-a-p/MAP-CC

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Multimodal Art Projection
License

Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically

Description

MAP-CC

🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.

  Disclaimer

This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.

Search
Clear search
Close search
Google apps
Main menu