Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus
arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon
Data Statistics
Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count
aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539
agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022
artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb.
Facebook
Twitterhuggingface/map-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
II-Bench
🌐 Homepage | 🤗 Paper | 📖 arXiv | 🤗 Dataset | GitHub
Introduction
II-Bench comprises 1,222 images, each accompanied by 1 to 3 multiple-choice questions, totaling 1,434 questions. II-Bench encompasses images from six distinct domains: Life, Art, Society, Psychology, Environment and Others. It also features a diverse array of image types, including Illustrations, Memes, Posters, Multi-panel Comics, Single-panel Comics, Logos and Paintings. The detailed… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/II-Bench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
MapAnything Dataset
Dataset Description
This dataset contains pre-computed metadata and covisibility matrices for supporting the MapAnything codebase. This metadata enables easy reproducible training and benchmarking for feed-forward 3D reconstruction tasks. Please see our Data Processing README for more details.
Citation
If you use this dataset in your research, please cite our paper: @inproceedings{keetha2025mapanything, title={{MapAnything}: Universal… See the full description on the dataset page: https://huggingface.co/datasets/facebook/map-anything.
Facebook
Twitterm-a-p/OO1-Chat-747K dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
[🏠Homepage] | [🛠️Code]
OpenCodeInterpreter
OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MAP-CC
🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
ItzCornflakez/map-image-captions dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterFineLeanCorpus: A Large-Scale, High-Quality Lean 4 Formalization Dataset
🔍 Overview
FineLeanCorpus is the largest high-quality dataset of natural language mathematical statements paired with their formalizations in Lean 4, comprising 509,356 entries. This dataset is designed to advance research in mathematical autoformalization—the translation of natural language mathematics into formal, machine-verifiable code. The corpus is distinguished by its:
Scale: 509K entries… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineLeanCorpus.
Facebook
Twitterkaggle-map/data dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitter📊 Dataset Overview
The emoji-map dataset, created by omarkamali, contains text data in parquet format. It consists of 10K-100K entries, specifically 5.03k rows. The dataset is available in the train split.
📁 Data Structure
The dataset includes two main columns: emoji and unicode_description. The emoji column contains various emoji characters, while the unicode_description column provides a textual description of each emoji.
🔍 Sample Data
Examples from the… See the full description on the dataset page: https://huggingface.co/datasets/omarkamali/emoji-map.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
m-a-p/Chords1217 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
TwitterCOIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning
Dataset Details
Dataset Description
欢迎来到COIG-CQIA,COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need, 是一个开源的高质量指令微调数据集,旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据,经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发,使用少量高质量的数据即可让大语言模型学习到人类交互行为,因此在数据构建中我们十分注重数据的来源、质量与多样性,数据集详情请见数据介绍以及我们接下来的论文。 Welcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.
Facebook
TwitterCodeCriticBench: A Holistic Benchmark for Code Critique in LLMs
💥 Introduction
CodeCriticBench is a comprehensive benchmark designed to systematically evaluate the critique capabilities of large language models (LLMs) in both code generation and code-question answering tasks. Beyond focusing on code generation, this benchmark extends to code-related questions, offering multidimensional and fine-grained evaluation criteria to rigorously assess LLMs' reasoning and code… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeCriticBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This repository contains the dataset and supplementary materials for the paper COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes.
🔔 Introduction
COIG-Writer is a large-scale Chinese creative writing dataset that connects final literary works with their underlying reasoning processes.Each sample includes a reverse-engineered writing prompt, a step-by-step reasoning trace, and the final article.This design allows researchers to explore… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-Writer.
Facebook
Twitterhttps://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
The Security Attack Pattern (TTP) Recognition or Mapping Task
We share in this repo the MITRE ATT&CK mapping datasets, with training, validation and test splits. The datasets can be considered as an emerging and challenging multilabel classification NLP task, with over 600 hierarchical classes. NOTE: due to their security nature, these datasets contain textual information about malware and other security aspects.
Datasets
TRAM
This dataset belongs to CTID… See the full description on the dataset page: https://huggingface.co/datasets/tumeteor/Security-TTP-Mapping.
Facebook
TwitterOmniBench
🌐 Homepage | 🏆 Leaderboard | 📖 Arxiv Paper | 🤗 Paper | 🤗 OmniBench Dataset | | 🤗 OmniInstruct_V1 Dataset | 🦜 Tweets The project introduces OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. We define models capable of such tri-modal processing as omni-language models (OLMs).
Mini Leaderboard
This table shows the omni-language models in… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/OmniBench.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SimpleVQA
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Dataset: https://huggingface.co/datasets/m-a-p/SimpleVQA
Abstract
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SimpleVQA.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PIN-14M
A mini version of "PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents" Paper: https://arxiv.org/abs/2406.13923 This dataset contains 14M samples in PIN format, with around 18.79 TB storage. 🚀 News [ 2025.09.04 ] !NEW! 🔥 We have completed the final version of the PIN-14M dataset and conducted some simple statistics on it. [ 2024.12.12 ] !NEW! 🔥 We have updated the quality signals for all subsets, with the dataset now containing 7.33B tokens… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/PIN-14M.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community