Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MAP-CC
🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
II-Bench
🌐 Homepage | 🤗 Paper | 📖 arXiv | 🤗 Dataset | GitHub
Introduction
II-Bench comprises 1,222 images, each accompanied by 1 to 3 multiple-choice questions, totaling 1,434 questions. II-Bench encompasses images from six distinct domains: Life, Art, Society, Psychology, Environment and Others. It also features a diverse array of image types, including Illustrations, Memes, Posters, Multi-panel Comics, Single-panel Comics, Logos and Paintings. The detailed… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/II-Bench.
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
This repository contains the data presented in SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines.
Tutorials for submitting to the official leadboard
coming soon
📜 License
SuperGPQA is a composite dataset that includes both original content and portions of data derived from other sources. The dataset is made available under the Open Data Commons Attribution License (ODC-BY), which asserts no copyright over the underlying content. This means that while the… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SuperGPQA.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement
[🏠Homepage] | [🛠️Code]
OpenCodeInterpreter
OpenCodeInterpreter is a family of open-source code generation systems designed to bridge the gap between large language models and advanced proprietary systems like the GPT-4 Code Interpreter. It significantly advances code generation capabilities by integrating execution and iterative refinement functionalities. For further information and… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
m-a-p/neo_sft_phase2 dataset hosted on Hugging Face and contributed by the HF Datasets community
huggingface/map-test dataset hosted on Hugging Face and contributed by the HF Datasets community
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
FineFineWeb: A Comprehensive Study on Fine-Grained Domain Web Corpus
arXiv: Coming Soon Project Page: Coming Soon Blog: Coming Soon
Data Statistics
Domain (#tokens/#samples) Iteration 1 Tokens Iteration 2 Tokens Iteration 3 Tokens Total Tokens Iteration 1 Count Iteration 2 Count Iteration 3 Count Total Count
aerospace 5.77B 261.63M 309.33M 6.34B 9100000 688505 611034 10399539
agronomy 13.08B 947.41M 229.04M 14.26B 15752828 2711790 649404 19114022
artistic… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/FineFineWeb-sample.
COIG-CQIA:Quality is All you need for Chinese Instruction Fine-tuning
Dataset Details
Dataset Description
欢迎来到COIG-CQIA,COIG-CQIA全称为Chinese Open Instruction Generalist - Quality is All You Need, 是一个开源的高质量指令微调数据集,旨在为中文NLP社区提供高质量且符合人类交互行为的指令微调数据。COIG-CQIA以中文互联网获取到的问答及文章作为原始数据,经过深度清洗、重构及人工审核构建而成。本项目受LIMA: Less Is More for Alignment等研究启发,使用少量高质量的数据即可让大语言模型学习到人类交互行为,因此在数据构建中我们十分注重数据的来源、质量与多样性,数据集详情请见数据介绍以及我们接下来的论文。 Welcome to the COIG-CQIA… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/COIG-CQIA.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
🌐 DemoPage | 🤗SFT Dataset | 🤗 Benchmark | 📖 arXiv | 💻 Code | 🤖 Chat Model | 🤖 Base Model
Dataset Card for MusicPile
MusicPile is the first pretraining corpus for developing musical abilities in large language models. It has 5.17M samples and approximately 4.16B tokens, including web-crawled corpora, encyclopedias, music books, youtube music captions, musical pieces in abc notation, math content, and code. You can easily load it:from datasets import load_dataset ds =… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MusicPile.
📊 Dataset Overview
The emoji-map dataset, created by omarkamali, contains text data in parquet format. It consists of 10K-100K entries, specifically 5.03k rows. The dataset is available in the train split.
📁 Data Structure
The dataset includes two main columns: emoji and unicode_description. The emoji column contains various emoji characters, while the unicode_description column provides a textual description of each emoji.
🔍 Sample Data
Examples from the… See the full description on the dataset page: https://huggingface.co/datasets/omarkamali/emoji-map.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
SimpleVQA
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Dataset: https://huggingface.co/datasets/m-a-p/SimpleVQA
Abstract
The increasing application of multi-modal large language models (MLLMs) across various sectors have spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/SimpleVQA.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for OAM-TCD: A globally diverse dataset of high-resolution tree cover maps
Annotation example in OAM-TCD (ID 1445), RGB image licensed CC BY-4.0, attribution contributors of OIN. Left: RGB aerial image, Middle: annotations shown, distinguished by instance ID, Right: annotations identified by class (blue = tree, orange = canopy)
Dataset Details
OAM-TCD is a dataset of high-resolution (10 cm/px) tree cover maps with instance-level masks for 280k trees and… See the full description on the dataset page: https://huggingface.co/datasets/restor/tcd.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
OpenSatMap Dataset Card
Description
The dataset contains 3,787 high-resolution satellite images with fine-grained annotations, covering diverse geographic locations and popular driving datasets. It can be used for large-scale map construction and downstream tasks like autonomous driving. The images are collected from Google Maps at level 19 resolution (0.3m/pixel) and level 20 resolution (0.15m/pixel), we denote them as OpenSatMap19 and OpenSatMap20, respectively.… See the full description on the dataset page: https://huggingface.co/datasets/z-hb/OpenSatMap.
m-a-p/OmniInstruct dataset hosted on Hugging Face and contributed by the HF Datasets community
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Dataset Card for Map It Anywhere (MIA)
The Map It Anywhere (MIA) dataset contains map-prediction-ready data curated from public datasets.
Dataset Details
Dataset Description
The Map It Anywhere (MIA) dataset contains 1.2 million high quality first-person-view (FPV) and bird's eye view (BEV) map pairs covering 470 squared km, thereby facilitating future map prediction research on generalizability and robustness. The dataset is curated using the MIA data engine… See the full description on the dataset page: https://huggingface.co/datasets/cherieho/mia_dataset.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Counter Strike Map Dataset
This dataset consists of Counter Strike map images along with their corresponding labels and x-y coordinates. The dataset is suitable for image classification tasks and includes the necessary information for each image.
Dataset Details
Total Images: [1424] Classes: [5] Image Size: [1920x1080] Format: [png]
Files
The dataset includes the following files:
maps/train/: This folder contains the Counter Strike map images. The images are… See the full description on the dataset page: https://huggingface.co/datasets/HOXSEC/csgo-maps.
debisoft/gimp-predator-map-dataset dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
m-a-p/MTT dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
os-map/OS-Map dataset hosted on Hugging Face and contributed by the HF Datasets community
m-a-p/OpenO1_SFT_ultra_BoN_positvie_reward_v3_N-sample dataset hosted on Hugging Face and contributed by the HF Datasets community
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
MAP-CC
🌐 Homepage | 🤗 MAP-CC | 🤗 CHC-Bench | 🤗 CT-LLM | 📖 arXiv | GitHub An open-source Chinese pretraining dataset with a scale of 800 billion tokens, offering the NLP community high-quality Chinese pretraining data.
Disclaimer
This model, developed for academic purposes, employs rigorously compliance-checked training data to uphold the highest standards of integrity and compliance. Despite our efforts, the inherent complexities of data and the broad spectrum of… See the full description on the dataset page: https://huggingface.co/datasets/m-a-p/MAP-CC.