100+ datasets found

h
neo_sft_phase2
huggingface.co
Updated Jun 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). neo_sft_phase2 [Dataset]. https://huggingface.co/datasets/m-a-p/neo_sft_phase2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2024
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
FineFineWeb
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection, FineFineWeb [Dataset]. https://huggingface.co/datasets/m-a-p/FineFineWeb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus

arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon

Data Statistics

Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count

aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539

agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022

artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.
h
map-anything
huggingface.co
Updated Sep 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI at Meta (2025). map-anything [Dataset]. https://huggingface.co/datasets/facebook/map-anything
Explore at:
Dataset updated
Sep 16, 2025
Dataset authored and provided by
AI at Meta
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
MapAnything Dataset

Dataset Description

This dataset contains pre-computed metadata and covisibility matrices for supporting the MapAnything codebase. This metadata enables easy reproducible training and benchmarking for feed-forward 3D reconstruction tasks. Please see our Data Processing README for more details.

Citation

If you use this dataset in your research, please cite our paper: @inproceedings{keetha2025mapanything, title={{MapAnything}: Universal… See the full description on the dataset page: https://huggingface.co/datasets/facebook/map-anything.
map-test
huggingface.co
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face (2023). map-test [Dataset]. https://huggingface.co/datasets/huggingface/map-test
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 1, 2023
Dataset authored and provided by
Hugging Facehttps://huggingface.co/
Description
huggingface/map-test dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Code-Feedback
huggingface.co
Updated Mar 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). Code-Feedback [Dataset]. https://huggingface.co/datasets/m-a-p/Code-Feedback
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 2, 2024
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

[🏠Homepage] | [🛠️Code]

Introduction

OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and related… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/Code-Feedback.
h
OO1-Chat-747K
huggingface.co
Updated Nov 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). OO1-Chat-747K [Dataset]. https://huggingface.co/datasets/m-a-p/OO1-Chat-747K
Explore at:
Dataset updated
Nov 11, 2025
Dataset authored and provided by
Multimodal Art Projection
Description
m-a-p/OO1-Chat-747K dataset hosted on Hugging Face and contributed by the HF Datasets community
h
II-Bench
huggingface.co
Updated Jun 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). II-Bench [Dataset]. https://huggingface.co/datasets/m-a-p/II-Bench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 21, 2024
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
II-Bench

🌐 Homepage | 🤗 Paper | 📖 arXiv | 🤗 Dataset | GitHub

Introduction

II-Bench comprises 1,222 images, each accompanied by 1 to 3 multiple-choice questions, totaling 1,434 questions. II-Bench encompasses images from six distinct domains: Life, Art, Society, Psychology, Environment and Others. It also features a diverse array of image types, including Illustrations, Memes, Posters, Multi-panel Comics, Single-panel Comics, Logos and Paintings. The detailed… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/II-Bench.
h
MAP-CC
huggingface.co
Updated Apr 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). MAP-CC [Dataset]. https://huggingface.co/datasets/m-a-p/MAP-CC
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 5, 2024
Dataset authored and provided by
Multimodal Art Projection
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
MAP-CC

🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.

Disclaimer

This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.
h
FineLeanCorpus
huggingface.co
Updated Jul 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). FineLeanCorpus [Dataset]. https://huggingface.co/datasets/m-a-p/FineLeanCorpus
Explore at:
Dataset updated
Jul 26, 2025
Dataset authored and provided by
Multimodal Art Projection
Description
FineLeanCorpus: A Large-Scale, High-Quality Lean 4 Formalization Dataset

🔍 Overview

FineLeanCorpus is the largest high-quality dataset of natural language mathematical statements paired with their formalizations in Lean 4, comprising 509,356 entries. This dataset is designed to advance research in mathematical autoformalization—the translation of natural language mathematics into formal, machine-verifiable code. The corpus is distinguished by its:

Scale: 509K entries… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineLeanCorpus.
h
emoji-map
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Omar Kamali (2024). emoji-map [Dataset]. https://huggingface.co/datasets/omarkamali/emoji-map
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Omar Kamali
Description
📊 Dataset Overview

The emoji-map dataset, created by omarkamali, contains text data in parquet format. It consists of 10K-100K entries, specifically 5.03k rows. The dataset is available in the train split.

📁 Data Structure

The dataset includes two main columns: emoji and unicode_description. The emoji column contains various emoji characters, while the unicode_description column provides a textual description of each emoji.

🔍 Sample Data

Examples from the… See the full description on the dataset page: https://huggingface.co/datasets/omarkamali/emoji-map.
h
map-image-captions
huggingface.co
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Österberg (2025). map-image-captions [Dataset]. https://huggingface.co/datasets/ItzCornflakez/map-image-captions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 22, 2025
Authors
Alexander Österberg
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ItzCornflakez/map-image-captions dataset hosted on Hugging Face and contributed by the HF Datasets community
data
huggingface.co
Updated Jul 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kaggle MAP (2025). data [Dataset]. https://huggingface.co/datasets/kaggle-map/data
Explore at:
Dataset updated
Jul 27, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Kaggle MAP
Description
kaggle-map/data dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Chords1217
huggingface.co
Updated Aug 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). Chords1217 [Dataset]. https://huggingface.co/datasets/m-a-p/Chords1217
Explore at:
Dataset updated
Aug 4, 2025
Dataset authored and provided by
Multimodal Art Projection
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
m-a-p/Chords1217 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
COIG-Writer
huggingface.co
Updated Oct 18, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). COIG-Writer [Dataset]. https://huggingface.co/datasets/m-a-p/COIG-Writer
Explore at:
Dataset updated
Oct 18, 2025
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This repository contains the dataset and supplementary materials for the paper COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes.

🔔 Introduction

COIG-Writer is a large-scale Chinese creative writing dataset that connects final literary works with their underlying reasoning processes.Each sample includes a reverse-engineered writing prompt, a step-by-step reasoning trace, and the final article.This design allows researchers to explore… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-Writer.
h
OmniBench
huggingface.co
Updated Sep 24, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). OmniBench [Dataset]. https://huggingface.co/datasets/m-a-p/OmniBench
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 24, 2024
Dataset authored and provided by
Multimodal Art Projection
Description
OmniBench

🌐 Homepage | 🏆 Leaderboard | 📖 Arxiv Paper | 🤗 Paper | 🤗 OmniBench Dataset | | 🤗 OmniInstruct_V1 Dataset | 🦜 Tweets The project introduces OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs).

Mini Leaderboard

This table shows the omni-language models in… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/OmniBench.
h
COIG-CQIA
huggingface.co
opendatalab.com
Updated Feb 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). COIG-CQIA [Dataset]. https://huggingface.co/datasets/m-a-p/COIG-CQIA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 2, 2024
Dataset authored and provided by
Multimodal Art Projection
Description
COIG-CQIA：Quality is All you need for Chinese Instruction Fine-tuning

Dataset Details Dataset Description

欢迎来到COIG-CQIA，COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need，是一个开源的高质量指令微调数据集，旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据，经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发，使用少量高质量的数据即可让大语言模型学习到人类交互行为，因此在数据构建中我们十分注重数据的来源、质量与多样性，数据集详情请见数据介绍以及我们接下来的论文。 Welcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.
h
PIN-14M
huggingface.co
Updated Dec 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2024). PIN-14M [Dataset]. https://huggingface.co/datasets/m-a-p/PIN-14M
Explore at:
Dataset updated
Dec 14, 2024
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
PIN-14M

A mini version of "PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents" Paper: https://arxiv.org/abs/2406.13923 This dataset contains 14M samples in PIN format, with around 18.79 TB storage. 🚀 News [ 2025.09.04 ] !NEW! 🔥 We have completed the final version of the PIN-14M dataset and conducted some simple statistics on it. [ 2024.12.12 ] !NEW! 🔥 We have updated the quality signals for all subsets, with the dataset now containing 7.33B tokens… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/PIN-14M.
h
CodeCriticBench
huggingface.co
Updated Feb 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). CodeCriticBench [Dataset]. https://huggingface.co/datasets/m-a-p/CodeCriticBench
Explore at:
Dataset updated
Feb 25, 2025
Dataset authored and provided by
Multimodal Art Projection
Description
CodeCriticBench: A Holistic Benchmark for Code Critique in LLMs

💥 Introduction

CodeCriticBench is a comprehensive benchmark designed to systematically evaluate the critique capabilities of large language models (LLMs) in both code generation and code-question answering tasks. Beyond focusing on code generation, this benchmark extends to code-related questions, offering multidimensional and fine-grained evaluation criteria to rigorously assess LLMs' reasoning and code… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeCriticBench.
h
SimpleVQA
huggingface.co
Updated Apr 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Multimodal Art Projection (2025). SimpleVQA [Dataset]. https://huggingface.co/datasets/m-a-p/SimpleVQA
Explore at:
Dataset updated
Apr 7, 2025
Dataset authored and provided by
Multimodal Art Projection
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
SimpleVQA

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Dataset: https://huggingface.co/datasets/m-a-p/SimpleVQA

Abstract

The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SimpleVQA.
h
Security-TTP-Mapping
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tu Nguyen, Security-TTP-Mapping [Dataset]. http://doi.org/10.57967/hf/1811
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/1811
Authors
Tu Nguyen
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
The Security Attack Pattern (TTP) Recognition or Mapping Task

We share in this repo the MITRE ATT&CK mapping datasets, with training, validation and test splits. The datasets can be considered as an emerging and challenging multilabel classification NLP task, with over 600 hierarchical classes. NOTE: due to their security nature, these datasets contain textual information about malware and other security aspects.

Datasets TRAM

This dataset belongs to CTID… See the full description on the dataset page: https://huggingface.co/datasets/tumeteor/Security-TTP-Mapping.

Facebook

Twitter

Click to copy link

Link copied

Cite

Multimodal Art Projection (2024). neo_sft_phase2 [Dataset]. https://huggingface.co/datasets/m-a-p/neo_sft_phase2

neo_sft_phase2

m-a-p/neo_sft_phase2

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 12, 2024

Dataset authored and provided by

Multimodal Art Projection

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

neo_sft_phase2

FineFineWeb

map-anything

map-test

Code-Feedback

OO1-Chat-747K

II-Bench

MAP-CC

FineLeanCorpus

emoji-map

map-image-captions

data

Chords1217

COIG-Writer

OmniBench

COIG-CQIA

PIN-14M

CodeCriticBench

SimpleVQA

Security-TTP-Mapping

neo_sft_phase2

m-a-p/neo_sft_phase2