100+ datasets found
  1. h

    neo_sft_phase2

    • huggingface.co
    Updated Jun 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). neo_sft_phase2 [Dataset]. https://huggingface.co/datasets/m-a-p/neo_sft_phase2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 12, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    FineFineWeb

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection, FineFineWeb [Dataset]. https://huggingface.co/datasets/m-a-p/FineFineWeb
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus

    arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon

      Data Statistics
    

    Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count

    aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539

    agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022

    artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.

  3. map-test

    • huggingface.co
    Updated Mar 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face (2023). map-test [Dataset]. https://huggingface.co/datasets/huggingface/map-test
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 1, 2023
    Dataset authored and provided by
    Hugging Facehttps://huggingface.co/
    Description

    huggingface/map-test dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    II-Bench

    • huggingface.co
    Updated Jun 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). II-Bench [Dataset]. https://huggingface.co/datasets/m-a-p/II-Bench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 21, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    II-Bench

    🌐 Homepage | 🤗 Paper | 📖 arXiv | 🤗 Dataset | GitHub

      Introduction
    

    II-Bench comprises 1,222 images, each accompanied by 1 to 3 multiple-choice questions, totaling 1,434 questions. II-Bench encompasses images from six distinct domains: Life, Art, Society, Psychology, Environment and Others. It also features a diverse array of image types, including Illustrations, Memes, Posters, Multi-panel Comics, Single-panel Comics, Logos and Paintings. The detailed… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/II-Bench.

  5. h

    map-anything

    • huggingface.co
    Updated Sep 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI at Meta (2025). map-anything [Dataset]. https://huggingface.co/datasets/facebook/map-anything
    Explore at:
    Dataset updated
    Sep 16, 2025
    Dataset authored and provided by
    AI at Meta
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    MapAnything Dataset

      Dataset Description
    

    This dataset contains pre-computed metadata and covisibility matrices for supporting the MapAnything codebase. This metadata enables easy reproducible training and benchmarking for feed-forward 3D reconstruction tasks. Please see our Data Processing README for more details.

      Citation
    

    If you use this dataset in your research, please cite our paper: @inproceedings{keetha2025mapanything, title={{MapAnything}: Universal… See the full description on the dataset page: https://huggingface.co/datasets/facebook/map-anything.

  6. h

    OO1-Chat-747K

    • huggingface.co
    Updated Nov 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). OO1-Chat-747K [Dataset]. https://huggingface.co/datasets/m-a-p/OO1-Chat-747K
    Explore at:
    Dataset updated
    Nov 11, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    m-a-p/OO1-Chat-747K dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. h

    CodeFeedback-Filtered-Instruction

    • huggingface.co
    Updated Mar 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). CodeFeedback-Filtered-Instruction [Dataset]. https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 3, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

    [🏠Homepage] | [🛠️Code]

      OpenCodeInterpreter
    

    OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction.

  8. h

    MAP-CC

    • huggingface.co
    Updated Apr 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). MAP-CC [Dataset]. https://huggingface.co/datasets/m-a-p/MAP-CC
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 5, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    MAP-CC

    🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.

      Disclaimer
    

    This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.

  9. h

    map-image-captions

    • huggingface.co
    Updated Jan 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Österberg (2025). map-image-captions [Dataset]. https://huggingface.co/datasets/ItzCornflakez/map-image-captions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 22, 2025
    Authors
    Alexander Österberg
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ItzCornflakez/map-image-captions dataset hosted on Hugging Face and contributed by the HF Datasets community

  10. h

    FineLeanCorpus

    • huggingface.co
    Updated Jul 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). FineLeanCorpus [Dataset]. https://huggingface.co/datasets/m-a-p/FineLeanCorpus
    Explore at:
    Dataset updated
    Jul 26, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    FineLeanCorpus: A Large-Scale, High-Quality Lean 4 Formalization Dataset

      🔍 Overview
    

    FineLeanCorpus is the largest high-quality dataset of natural language mathematical statements paired with their formalizations in Lean 4, comprising 509,356 entries. This dataset is designed to advance research in mathematical autoformalization—the translation of natural language mathematics into formal, machine-verifiable code. The corpus is distinguished by its:

    Scale: 509K entries… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineLeanCorpus.

  11. data

    • huggingface.co
    Updated Jul 27, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaggle MAP (2025). data [Dataset]. https://huggingface.co/datasets/kaggle-map/data
    Explore at:
    Dataset updated
    Jul 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Kaggle MAP
    Description

    kaggle-map/data dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. h

    emoji-map

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Omar Kamali (2024). emoji-map [Dataset]. https://huggingface.co/datasets/omarkamali/emoji-map
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Omar Kamali
    Description

    📊 Dataset Overview

    The emoji-map dataset, created by omarkamali, contains text data in parquet format. It consists of 10K-100K entries, specifically 5.03k rows. The dataset is available in the train split.

      📁 Data Structure
    

    The dataset includes two main columns: emoji and unicode_description. The emoji column contains various emoji characters, while the unicode_description column provides a textual description of each emoji.

      🔍 Sample Data
    

    Examples from the… See the full description on the dataset page: https://huggingface.co/datasets/omarkamali/emoji-map.

  13. h

    Chords1217

    • huggingface.co
    Updated Aug 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). Chords1217 [Dataset]. https://huggingface.co/datasets/m-a-p/Chords1217
    Explore at:
    Dataset updated
    Aug 4, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    m-a-p/Chords1217 dataset hosted on Hugging Face and contributed by the HF Datasets community

  14. h

    COIG-CQIA

    • huggingface.co
    • opendatalab.com
    Updated Feb 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). COIG-CQIA [Dataset]. https://huggingface.co/datasets/m-a-p/COIG-CQIA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 2, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    COIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    欢迎来到COIG-CQIA,COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need, 是一个开源的高质量指令微调数据集,旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据,经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发,使用少量高质量的数据即可让大语言模型学习到人类交互行为,因此在数据构建中我们十分注重数据的来源、质量与多样性,数据集详情请见数据介绍以及我们接下来的论文。 Welcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.

  15. h

    CodeCriticBench

    • huggingface.co
    Updated Feb 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). CodeCriticBench [Dataset]. https://huggingface.co/datasets/m-a-p/CodeCriticBench
    Explore at:
    Dataset updated
    Feb 25, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    CodeCriticBench: A Holistic Benchmark for Code Critique in LLMs

      💥 Introduction
    

    CodeCriticBench is a comprehensive benchmark designed to systematically evaluate the critique capabilities of large language models (LLMs) in both code generation and code-question answering tasks. Beyond focusing on code generation, this benchmark extends to code-related questions, offering multidimensional and fine-grained evaluation criteria to rigorously assess LLMs' reasoning and code… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeCriticBench.

  16. h

    COIG-Writer

    • huggingface.co
    Updated Oct 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). COIG-Writer [Dataset]. https://huggingface.co/datasets/m-a-p/COIG-Writer
    Explore at:
    Dataset updated
    Oct 18, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This repository contains the dataset and supplementary materials for the paper COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes.

      🔔 Introduction
    

    COIG-Writer is a large-scale Chinese creative writing dataset that connects final literary works with their underlying reasoning processes.Each sample includes a reverse-engineered writing prompt, a step-by-step reasoning trace, and the final article.This design allows researchers to explore… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-Writer.

  17. h

    Security-TTP-Mapping

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tu Nguyen, Security-TTP-Mapping [Dataset]. http://doi.org/10.57967/hf/1811
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Tu Nguyen
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    The Security Attack Pattern (TTP) Recognition or Mapping Task

    We share in this repo the MITRE ATT&CK mapping datasets, with training, validation and test splits. The datasets can be considered as an emerging and challenging multilabel classification NLP task, with over 600 hierarchical classes. NOTE: due to their security nature, these datasets contain textual information about malware and other security aspects.

      Datasets
    
    
    
    
    
    
      TRAM
    

    This dataset belongs to CTID… See the full description on the dataset page: https://huggingface.co/datasets/tumeteor/Security-TTP-Mapping.

  18. h

    OmniBench

    • huggingface.co
    Updated Sep 24, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). OmniBench [Dataset]. https://huggingface.co/datasets/m-a-p/OmniBench
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 24, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    Description

    OmniBench

    🌐 Homepage | 🏆 Leaderboard | 📖 Arxiv Paper | 🤗 Paper | 🤗 OmniBench Dataset | | 🤗 OmniInstruct_V1 Dataset | 🦜 Tweets The project introduces OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs).

      Mini Leaderboard
    

    This table shows the omni-language models in… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/OmniBench.

  19. h

    SimpleVQA

    • huggingface.co
    Updated Apr 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2025). SimpleVQA [Dataset]. https://huggingface.co/datasets/m-a-p/SimpleVQA
    Explore at:
    Dataset updated
    Apr 7, 2025
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    SimpleVQA

      SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
    

    Dataset: https://huggingface.co/datasets/m-a-p/SimpleVQA

      Abstract
    

    The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SimpleVQA.

  20. h

    PIN-14M

    • huggingface.co
    Updated Dec 14, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Multimodal Art Projection (2024). PIN-14M [Dataset]. https://huggingface.co/datasets/m-a-p/PIN-14M
    Explore at:
    Dataset updated
    Dec 14, 2024
    Dataset authored and provided by
    Multimodal Art Projection
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    PIN-14M

    A mini version of "PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents" Paper: https://arxiv.org/abs/2406.13923 This dataset contains 14M samples in PIN format, with around 18.79 TB storage. 🚀 News [ 2025.09.04 ] !NEW! 🔥 We have completed the final version of the PIN-14M dataset and conducted some simple statistics on it. [ 2024.12.12 ] !NEW! 🔥 We have updated the quality signals for all subsets, with the dataset now containing 7.33B tokens… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/PIN-14M.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Multimodal Art Projection (2024). neo_sft_phase2 [Dataset]. https://huggingface.co/datasets/m-a-p/neo_sft_phase2

neo_sft_phase2

m-a-p/neo_sft_phase2

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2024
Dataset authored and provided by
Multimodal Art Projection
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu