56 datasets found

h
sciriff-license-filtered-final-commercial
huggingface.co
Updated Oct 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Morrison (2025). sciriff-license-filtered-final-commercial [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-license-filtered-final-commercial
Explore at:
Dataset updated
Oct 20, 2025
Authors
Jacob Morrison
Description
jacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community
h
sciriff-filtered-licenses
huggingface.co
Updated Oct 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Morrison (2025). sciriff-filtered-licenses [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-filtered-licenses
Explore at:
Dataset updated
Oct 13, 2025
Authors
Jacob Morrison
Description
jacobmorrison/sciriff-filtered-licenses dataset hosted on Hugging Face and contributed by the HF Datasets community
h
Commercially-Verified-Licenses
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Provenance Initiative, Commercially-Verified-Licenses [Dataset]. https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Data Provenance Initiative
Description
Dataset Card for Data Provenance Initiative - Commercial-Licenses

Legal Disclaimer / Notice

Collected License Information is NOT Legal Advice. It is important to note we collect self-reported licenses, from the papers and repositories that released these datasets, and categorize them according to our best efforts, as a volunteer research and transparency initiative. The information provided by any of our works and any outputs of the Data Provenance Initiative do not, and… See the full description on the dataset page: https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses.
facebook/natural_reasoning
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
Explore at:
zip(1694591016 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
Zehra Korkusuz
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Natural Reasoning Dataset

Source: Huggingface

Dataset Overview

Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

Dataset Information

File Format: natural_reasoning.parquet

Click here to view the dataset

📝 License: CC-BY-NC-4.0

🧠 Task Categories: Text Generation Reasoning

🌐 Language: English (en)

📊 Dataset Size: 1M < n < 10M

📥 Source: Hugging Face

📄 Original Paper: NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

How to Use

You can load the dataset directly from Hugging Face as follows:

from datasets import load_dataset ds = load_dataset("facebook/natural_reasoning")

Data Collection and Quality

The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

Reference Answer Statistics

In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

Scaling Curve Performance

Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

Citation

If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

@misc{yuan2025naturalreasoningreasoningwild28m, title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions}, author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li}, year={2025}, eprint={2502.13124}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.13124} }

Source: Hugging Face
h
stackv2_edu_filtered
huggingface.co
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Common Pile, stackv2_edu_filtered [Dataset]. https://huggingface.co/datasets/common-pile/stackv2_edu_filtered
Explore at:
Dataset authored and provided by
Common Pile
Description
Stack V2 Edu

Description

We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparing… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.
h
megalith-cc0
huggingface.co
Updated Jun 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spawning (2025). megalith-cc0 [Dataset]. https://huggingface.co/datasets/Spawning/megalith-cc0
Explore at:
Dataset updated
Jun 22, 2025
Dataset authored and provided by
Spawning
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Megalith-CC0

A CC0-filtered version of the Megalith-10m dataset. The images have also been persisted to an independent public S3 bucket, supported by the AWS Open Data Registry program, for durability.

Why filter by CC0?

The images in Megalith-10m, having been gathered from Flickr, have attached licenses of CC0 and public domain. However, it is not clear if users assigning the public domain license to their works understand the implications of the public domain license.… See the full description on the dataset page: https://huggingface.co/datasets/Spawning/megalith-cc0.
MiniPile
kaggle.com
opendatalab.com
+1more
zip
Updated Aug 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sebamenabar (2023). MiniPile [Dataset]. https://www.kaggle.com/datasets/sebamenabar/minipile-hf
Explore at:
zip(2790187812 bytes)Available download formats
Dataset updated
Aug 21, 2023
Authors
sebamenabar
Description
https://huggingface.co/datasets/JeanKaddour/minipile

Dataset Card for MiniPile

Table of Contents

Table of Contents

Dataset Description

Dataset Summary

Languages

Additional Information

Dataset Curators

Licensing Information

Citation Information

Dataset Description

The MiniPile Challenge for Data-Efficient Language Models

Dataset Summary

MiniPile is a 6GB subset of the deduplicated The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.

The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.

More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper.

For more details on the Pile corpus, we refer the reader to the Pile datasheet.

Languages

English (EN)

Additional Information

Dataset Curators

MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.

Licensing Information

Since MiniPile is a subset of the Pile, the same MIT License holds.

Citation Information

@article{kaddour2023minipile, title={The MiniPile Challenge for Data-Efficient Language Models}, author={Kaddour, Jean}, journal={arXiv preprint arXiv:2304.08442}, year={2023} } @article{gao2020pile, title={The {P}ile: An 800{GB} dataset of diverse text for language modeling}, author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others}, journal={arXiv preprint arXiv:2101.00027}, year={2020} }

Kanops Open Retail Imagery - Grocery Dataset

kaggle.com

zip

Updated Oct 24, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Steve Dresser (2025). Kanops Open Retail Imagery - Grocery Dataset [Dataset]. https://www.kaggle.com/datasets/stevedresser/kanops-open-retail-imagery-grocery-dataset

Explore at:

zip(98224876 bytes)Available download formats

Dataset updated

Oct 24, 2025

Authors

Steve Dresser

Description

🛒 Kanops — Open Access · Retail Imagery Dataset (v0)

~10,000 professional retail scene photographs from UK grocery stores for computer vision research

📊 Quick Stats

Attribute	Details
Total Images	~10,000 high-resolution photos
Markets	United Kingdom
Collections	2014 archive, Full store surveys, Halloween 2024
Privacy	All faces automatically blurred
License	Evaluation & Research Only
Format	JPEG with comprehensive metadata

🎯 What Can You Build?

This dataset is perfect for:

🏪 Shelf Detection - Train models to identify retail fixtures and layouts
📦 Product Recognition - Object detection in dense retail environments
📊 Planogram Analysis - Compare actual vs. planned merchandising
🎃 Seasonal Merchandising - Understand seasonal retail patterns (Halloween collection included)
🤖 Store Navigation - Spatial understanding for retail robotics
🔍 Visual Search - Build retail product search engines
📈 Competitive Intelligence - Benchmark merchandising strategies

📥 Access the Dataset

Primary Source: HuggingFace (Gated)

👉 Request access: https://huggingface.co/datasets/dresserman/kanops-open-access-imagery

This dataset is gated - request access on HuggingFace. By requesting access, you agree to the evaluation-only license terms.

Quick Start Code:

from datasets import load_dataset

# Load the dataset (after getting HuggingFace access)
ds = load_dataset(
  "imagefolder",
  data_dir="hf://datasets/dresserman/kanops-open-access-imagery/train",
  split="train",
)

# Access first image
img = ds[0]["image"] # PIL.Image
img.show()

Load Metadata:

import pandas as pd

meta = pd.read_csv(
  "hf://datasets/dresserman/kanops-open-access-imagery/metadata.csv"
)
print(meta.head())

📁 Dataset Structure

train/
├── 2014/
│  ├── Aldi/
│  ├── Tesco/
│  ├── Sainsburys/
│  └── ... (22 UK retailers)
├── FullStores/
│  ├── Tesco_Lincoln_2014/
│  ├── Tesco_Express_2015/
│  └── Asda_Leeds_2016/
└── Halloween2024/
  └── Various_Retailers/

Root files:
├── MANIFEST.csv     # File listing + basic attributes
├── metadata.csv     # Enriched metadata (retailer, dims, collection)
├── checksums.sha256   # Integrity verification
├── blur_log.csv     # Face-blur verification log
└── LICENSE        # Evaluation-only terms

📋 Metadata Schema

Each image includes comprehensive metadata in metadata.csv:

Field	Description
`file_name`	Path relative to dataset root
`bytes`	File size in bytes
`width`, `height`	Image dimensions
`sha256`	Content hash for integrity verification
`collection`	One of: `2014`, `FullStores`, `Halloween2024`
`retailer`	Inferred from file path
`year`	Inferred from file path

🔒 Privacy & Data Integrity

✅ All faces automatically blurred via automated detection + manual review
✅ SHA-256 checksums for every image (data integrity)
✅ Provenance tracking embedded in EXIF/IPTC/XMP metadata
✅ Gated access to ensure license compliance
✅ Takedown process available if needed

📜 License & Usage Terms

License: Evaluation & Research Only

✅ What You CAN Do:

Use for academic research and publications
Train and evaluate computer vision models (non-commercial)
Benchmark algorithm performance
Educational and learning purposes
Prototype development under evaluation terms

❌ What You CANNOT Do:

Redistribute the dataset or derivatives
Use for commercial production systems
Publicly release model weights trained on this data (without commercial license)
Marketing or brand endorsement

For commercial licensing: Contact happytohelp@groceryinsight.com

🏢 About This Sample Dataset

This free sample is part of Kanops Archive - a much larger commercial dataset used by AI companies and research institutions.

This Free Sample (v0):

~10,000 images from UK retailers only
2014-2024 timeframe
Evaluation and research use only

Full Commercial Dataset (RetailVision Archive):

1M+ images spanning 2011-2025
5 geographic markets: UK, Ireland, Netherlands, Germany, USA
280K+ seasonal images with granular categorization
15 years of retail evolution and trends
Professional curation by retail industry experts

Applications: - Training production computer vision models - Autonomous checkout systems - Retail robotics and automation - Seasonal demand forecasting - Market research and competitive intelligence

Learn more: [groceryinsight.com/retail-image-dataset](...

h
trpfrog-icons
huggingface.co
Updated Feb 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kasana Orikawa (2023). trpfrog-icons [Dataset]. https://huggingface.co/datasets/trpfrog/trpfrog-icons
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 10, 2023
Authors
Kasana Orikawa
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
trpfrog-icons Dataset

This is a dataset of TrpFrog's icons. By the way, what do you use this for? 🤔

How to use

from datasets import load_dataset

dataset = load_dataset("TrpFrog/trpfrog-icons")

print all data

for data in dataset["train"]: print(data)

remove not green icons

dataset = dataset.filter(lambda x: x["label"] == 0)

License

MIT License
Alpaca
kaggle.com
zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Alpaca [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-instructions-word-level-classification
Explore at:
zip(26297842 bytes)Available download formats
Dataset updated
Nov 24, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Alpaca

Alpaca - Training LLMs to follow instructions

By Huggingface Hub [source]

About this dataset

This dataset, TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding, provides a comprehensive collection of 122K Alpaca-style instructions with their associated input, text and output for word-level classification. It enables natural language understanding research to be done conveniently as it contains entries from diverse areas such as programming code instructions and gaming instructions that are written in varying levels of complexity. With the help of this dataset, developers aiming to apply natural language processing techniques for machines may gain insight into how to improve the accuracy and facilitate the comprehension of human language commands. By using this dataset, one may develop advanced algorithms such as neural networks or decision trees that can quickly understand commands in foreign languages and bridge the gap between machines and humans for different practical purposes

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains 122k Alpaca-Style Instructions with their corresponding input, text, and output for word-level classification. It is a valuable resource to those who wish to gain insight into natural language understanding through data science approaches. This guide will provide some tips on how to use this dataset in order to maximize accuracy and gain a better understanding of natural language.

Preprocessing: Cleaning the data is an essential step when dealing with any sort of text data which includes the Alpaca instructions dataset. This involves removing stopwords like articles, pronouns, etc., normalizing words such as capitalization or lemmatization, filtering for relevant terms based on context or key problems you are trying to solve; and finally tokenizing the remaining text into appropriate individual pieces that can be provided as input features for different models – SentencePiece is perfect for this sort of task.

Feature extraction: After preprocessing your text data it’s time to extract insightful features from it utilizing techniques like Bag-of-Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer etc., which might help you better understand the context behind each instruction sentence/word within the corpus. Additionally embedding techniques using word2vec/GloVe might also serve useful in extracting semantic information from these instructions while helping build classifiers successful at predicting word level categories related tasks (Semantic segmentation).

Model selection: Depending on your problem setup AI architectures like Support vector machines(SVMs)/Conditional Random Fields(CRFs)/ Attention Based Models should work well in tackling these types of tasks related towards NLP analysis at both sentence or shallow representation form levels (Part Of Speech tagging). If learning what words are used together efficiently matters more than all other options then selecting an RNN model such as LSTM or GRU might do wonders; they are similarly effective but faster modelling approach due its recursive structure that allows you store context information more effectively compared BOWs or TFIDF Vectors spaces separately built up during feature engineering processing periods per individual supervised training tasks points instead across all!

Evaluating Results: After choosing the best algorithm model fit analysis performance measures such as F1 scores should enable easier tracking end goal results adjustments if needed precision/recall levels are declining significantly past certain number values threshold points compared lower task confirming holding out uncategorized sample documents versus larger ID test portion splits train tests datasets subsets collected

Research Ideas

Developing an AI-based algorithm capable of accurately understanding the meaning of natural language instructions.

Using this dataset for training and testing machine learning models to classify specific words and phrases within natural language instructions.

Training a deep learning model to generate visual components based on the given input, text, and output values from this dataset

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Univer...
finevideo
huggingface.co
Updated Sep 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
Explore at:
Dataset updated
Sep 12, 2024
Dataset provided by
Hugging Facehttps://huggingface.co/
Authors
Hugging Face FineVideo
License
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Description
FineVideo

FineVideo Description Dataset Explorer Revisions Dataset Distribution

How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

Dataset StructureData Instances Data Fields

Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

Additional Information Credits Future Work Opting out of FineVideo Citation Information

Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.
Synthia-v1.3
kaggle.com
huggingface.co
zip
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Synthia-v1.3 [Dataset]. https://www.kaggle.com/datasets/thedevastator/human-machine-dialogue-interactions
Explore at:
zip(79056480 bytes)Available download formats
Dataset updated
Nov 22, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Human-Machine Dialogue Interactions

Exploring Communication Models for Machine Learning

By Huggingface Hub [source]

About this dataset

This Synthia-v1.3 dataset provides insight into the complexities of human-machine communication through its collection of dialogue interactions between humans and machines. Contained within this dataset are details on how conversations develop between the two, detailing behavioural changes in both humans and machines towards one another over time. With information provided on both user instructions to machines, as well as the system, machine responses and other related data points, this dataset offers a detailed overview of machine learning concepts, examining how systems utilise dialogue to interact with people in various scenarios. This can offer valuable insight into how predictive intelligence is applied by these systems in conversational settings, better informing developers seeking to build their own human-machine interfaces for effective two-way communication. By looking at this data set as a whole it can create an understanding of the way connections form between humans and machines providing a deeper level of appreciation for ongoing challenges faced when working on projects with these technological components at play

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The dataset consists of a collection of dialogue interactions between humans and machines, providing insight into human-machine communication. It includes information about the system being used, instructions given by humans to machines and responses from machines.

To start using this data set: - Download the csv file containing all of the dialogue interactions from Kaggle datasets page. - Open up your favourite spreadsheet software like Excel or Google Sheets and load up the CSV file - Take a look at each of the columns listed in order to familiarize yourself with what they contain: ‘system’ column contains details about what system was used for role play between human and machine; ‘instruction’ column contains instructions given by humans to machines; ‘response’ column contains responses from machines back to humans based on their instructions
- Start exploring how conversations progress between humans and machine over time by examining information in each of these columns separately or together as required

You can also filter out specific conditions within your data set such as searching for conversations that were driven entirely by particular systems or involving certain instruction types etc. In addition, you have an opportunity conduct various kinds of analysis such as statistical analysis (e.g., descriptive statistics or correlation analysis). With so many possibilities for exploration, you are sure find something interesting!

Research Ideas

Utilizing the dataset to understand how various types of instruction styles can influence conversation order and flow between humans and machines.

Using the data to predict potential responses in a given dialogue interaction from varying sources, such as robots or virtual assistants.

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------|:--------------------------------------------------------------| | system | The type of system used in the dialogue interaction. (String) | | instruction | The instruction given by the human to the machine. (String) | | response | The response given by the machine to the human. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
Social IQa (Social Interaction Q&A)
kaggle.com
zip
Updated Nov 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Social IQa (Social Interaction Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/social-i-qa-a-dataset-for-social-inquiry-questio/discussion
Explore at:
zip(2024126 bytes)Available download formats
Dataset updated
Nov 20, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Social IQa (Social Interaction Q&A)

Question-answering benchmark for testing commonsense social intelligence

Source

Huggingface Hub: link

About this dataset

We introduce Social IQa: Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. (Less)

How to use the dataset

This dataset can be used to train and test models for social inquiry question answering. The questions and answers in the dataset have been annotations by experts, and the dataset has been verified for accuracy.

Research Ideas

The dataset can be used to train a model to answer questions about social topics.

The dataset can be used to improve question-answering systems for social inquiry.

The dataset can be used to generate new questions about social topics

Acknowledgements

Huggingface Hub: link

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: validation.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |

File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |
CCDV Arxiv Summarization Dataset
kaggle.com
zip
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
Explore at:
zip(2219742528 bytes)Available download formats
Dataset updated
Dec 5, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV

By ccdv (From Huggingface) [source]

About this dataset

The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

How to use the dataset

Introduction:

File Description:

validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

Usage Examples: This dataset can be utilized in various ways:

a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

Research Ideas

Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.

Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.

Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...
Amazon Product Reviews
kaggle.com
zip
Updated Nov 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Amazon Product Reviews [Dataset]. https://www.kaggle.com/datasets/thedevastator/amazon-product-reviews
Explore at:
zip(699806296 bytes)Available download formats
Dataset updated
Nov 26, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Amazon Product Reviews

18 Years of Customer Ratings and Experiences

By Huggingface Hub [source]

About this dataset

The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.

2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.

3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).

4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models

Research Ideas

Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.

Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.

Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...
MetaMath QA
kaggle.com
zip
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b
Explore at:
zip(78629842 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
MetaMath QA

Mathematical Questions for Large Language Models

By Huggingface Hub [source]

About this dataset

This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

Data Dictionary

The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

Preparing data for analysis

It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

##### Training Models using Mistral 7B

Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

##### Testing phosphors :

After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

Research Ideas

Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.

Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.

Optimizing search algorithms that surface relevant answer results based on types of queries

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

Acknowledgements

If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.
h
falcon-refinedweb
huggingface.co
opendatalab.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/0737
Dataset authored and provided by
Technology Innovation Institute
License
https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/
Description
📀 Falcon RefinedWeb

Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.
LLM Feedback Collection
kaggle.com
zip
Updated Nov 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). LLM Feedback Collection [Dataset]. https://www.kaggle.com/datasets/thedevastator/fine-grained-gpt-4-evaluation
Explore at:
zip(159502027 bytes)Available download formats
Dataset updated
Nov 23, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
LLM Feedback Collection

Induce fine-grained evaluation capabilities into language models

By Huggingface Hub [source]

About this dataset

This dataset contains 100,000 feedback responses from GPT-4 AI models along with rubrics designed to evaluate both absolute and ranking scores. Each response is collected through a comprehensive evaluation process that takes into account the model's feedback, instruction, criteria for scoring, referenced answer and input given. This data provides researchers and developers with valuable insights into the performance of their AI models on various tasks as well as the ability to compare them against one another using precise and accurate measures. Each response is accompanied by five descriptive scores that give a detailed overview of its quality in terms of relevance to the input given, accuracy in reference to the reference answer provided, coherence between different parts of the output such as grammar and organization, fluency in expression of ideas without errors or unnecessary repetitions, and overall productivity accounting for all other factors combined. With this dataset at your disposal, you will be able to evaluate each output qualitatively without having to manually inspect every single response

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

This dataset contains feedback from GPT-4 models, along with associated rubrics for absolute and ranking scoring. It can be used to evaluate the performance of GPT-4 models on different challenging tasks.

In order to use this dataset effectively, it is important to understand the data provided in each column: - orig_feedback – Feedback given by the original GPT-4 model - orig_score2_description – Description of the second score given to the original GPT-4 model - orig_reference_answer – Reference answer used to evaluate the original GPT-4 model
- output – Output from the fine-grained evaluation
- orig_response – Response from the original GPT-4 model * orig_criteria – Criteria used to evaluate the original GPT-4 model *orig_instruction– Instruction given to the original GPT 4 model *orig_score3 _description– Description of third score given to

Research Ideas

Data-driven evaluation of GPT-4 models using the absolute and ranking scores collected from this dataset.

Training a deep learning model to automate the assessment of GPT-4 responses based on the rubrics provided in this dataset.

Building a semantic search engine using GPT-4 that is able to identify relevant responses more accurately with the help of this dataset's data collection metrics and rubrics for scoring

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: train.csv | Column name | Description | |:----------------------------|:---------------------------------------------------------------| | orig_feedback | Feedback from the evaluator. (Text) | | orig_score2_description | Description of the second score given by the evaluator. (Text) | | orig_reference_answer | Reference answer used to evaluate the model response. (Text) | | output | Output from the GPT-4 model. (Text) | | orig_response | Original response from the GPT-4 model. (Text) | | orig_criteria | Criteria used by the evaluator to rate the response. (Text) | | orig_instruction | Instructions provided by the evaluator. (Text) | | orig_score3_description | Description of the third score given by the evaluator. (Text) | | orig_score5_description | Description of the fifth score given by the evaluator. (Text) | | orig_score1_description | Description of the first score given by the evaluator. (Text) | | input | Input given to the evaluation. (Text) | | orig_score4_description | Description of the fourth score given by the evalua...
h
the-stack-dedup
huggingface.co
opendatalab.com
Updated Oct 27, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BigCode (2022). the-stack-dedup [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-dedup
Explore at:
Dataset updated
Oct 27, 2022
Dataset authored and provided by
BigCode
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for The Stack

Changelog

Release Description

v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.

v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.
tulu-3-ultrafeedback-cleaned-on-policy-8b
huggingface.co
Updated Nov 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ai2 (2025). tulu-3-ultrafeedback-cleaned-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-ultrafeedback-cleaned-on-policy-8b
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2025
Dataset provided by
Allen Institute for AIhttp://allenai.org/
Authors
Ai2
Description
Llama 3.1 Tulu 3 Ultrafeedback (Cleaned) (on-policy 8B)

Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture. It contains prompts from Ai2's cleaned version of Ultrafeedback which removes instances of TruthfulQA. We further filtered this dataset to remove… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-ultrafeedback-cleaned-on-policy-8b.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jacob Morrison (2025). sciriff-license-filtered-final-commercial [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-license-filtered-final-commercial

sciriff-license-filtered-final-commercial

jacobmorrison/sciriff-license-filtered-final-commercial

Explore at:

Dataset updated

Oct 20, 2025

Authors

Jacob Morrison

Description

jacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community

Clear search

Close search

Google apps

Main menu

sciriff-license-filtered-final-commercial

sciriff-filtered-licenses

Commercially-Verified-Licenses

facebook/natural_reasoning

Natural Reasoning Dataset

Dataset Overview

Dataset Information

How to Use

Data Collection and Quality

Reference Answer Statistics

Scaling Curve Performance

Citation

stackv2_edu_filtered

megalith-cc0

MiniPile

Dataset Card for MiniPile

Table of Contents

Dataset Description

Dataset Summary

Languages

Additional Information

Dataset Curators

Licensing Information

Citation Information

Kanops Open Retail Imagery - Grocery Dataset

🛒 Kanops — Open Access · Retail Imagery Dataset (v0)

📊 Quick Stats

🎯 What Can You Build?

📥 Access the Dataset

Primary Source: HuggingFace (Gated)

Quick Start Code:

Load Metadata:

📁 Dataset Structure

📋 Metadata Schema

🔒 Privacy & Data Integrity

📜 License & Usage Terms

✅ What You CAN Do:

❌ What You CANNOT Do:

🏢 About This Sample Dataset

This Free Sample (v0):

Full Commercial Dataset (RetailVision Archive):

trpfrog-icons

print all data

remove not green icons

Alpaca

Alpaca

Alpaca - Training LLMs to follow instructions

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

finevideo

Synthia-v1.3

Human-Machine Dialogue Interactions

Exploring Communication Models for Machine Learning

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Acknowledgements

Social IQa (Social Interaction Q&A)

Social IQa (Social Interaction Q&A)

Question-answering benchmark for testing commonsense social intelligence

Source

About this dataset

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

CCDV Arxiv Summarization Dataset

CCDV Arxiv Summarization Dataset

Arxiv Summarization Dataset for CCDV