56 datasets found
  1. h

    sciriff-license-filtered-final-commercial

    • huggingface.co
    Updated Oct 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Morrison (2025). sciriff-license-filtered-final-commercial [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-license-filtered-final-commercial
    Explore at:
    Dataset updated
    Oct 20, 2025
    Authors
    Jacob Morrison
    Description

    jacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community

  2. h

    sciriff-filtered-licenses

    • huggingface.co
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Morrison (2025). sciriff-filtered-licenses [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-filtered-licenses
    Explore at:
    Dataset updated
    Oct 13, 2025
    Authors
    Jacob Morrison
    Description

    jacobmorrison/sciriff-filtered-licenses dataset hosted on Hugging Face and contributed by the HF Datasets community

  3. h

    Commercially-Verified-Licenses

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Provenance Initiative, Commercially-Verified-Licenses [Dataset]. https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Data Provenance Initiative
    Description

    Dataset Card for Data Provenance Initiative - Commercial-Licenses

      Legal Disclaimer / Notice
    

    Collected License Information is NOT Legal Advice. It is important to note we collect self-reported licenses, from the papers and repositories that released these datasets, and categorize them according to our best efforts, as a volunteer research and transparency initiative. The information provided by any of our works and any outputs of the Data Provenance Initiative do not, and… See the full description on the dataset page: https://huggingface.co/datasets/DataProvenanceInitiative/Commercially-Verified-Licenses.

  4. facebook/natural_reasoning

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
    Explore at:
    zip(1694591016 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    Zehra Korkusuz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Natural Reasoning Dataset

    Source: Huggingface

    Dataset Overview

    Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

    A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

    Dataset Information

    File Format: natural_reasoning.parquet

    Click here to view the dataset

    How to Use

    You can load the dataset directly from Hugging Face as follows:

    from datasets import load_dataset
    
    ds = load_dataset("facebook/natural_reasoning")
    

    Data Collection and Quality

    The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

    Reference Answer Statistics

    In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

    Scaling Curve Performance

    Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

    https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

    Citation

    If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

    @misc{yuan2025naturalreasoningreasoningwild28m,
       title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
       author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
       year={2025},
       eprint={2502.13124},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2502.13124}
    }
    

    Source: Hugging Face

  5. h

    stackv2_edu_filtered

    • huggingface.co
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Common Pile, stackv2_edu_filtered [Dataset]. https://huggingface.co/datasets/common-pile/stackv2_edu_filtered
    Explore at:
    Dataset authored and provided by
    Common Pile
    Description

    Stack V2 Edu

      Description
    

    We filter the Stack V2 to only include code from openly licensed repositories, based on the license detection performed by the creators of Stack V2. When multiple licenses are detected in a single repository, we ensure that all of the licenses are on the Blue Oak Council certified license list. Per-document license information is available in the license entry of the metadata field of each example. Code for collecting, processing, and preparing… See the full description on the dataset page: https://huggingface.co/datasets/common-pile/stackv2_edu_filtered.

  6. h

    megalith-cc0

    • huggingface.co
    Updated Jun 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spawning (2025). megalith-cc0 [Dataset]. https://huggingface.co/datasets/Spawning/megalith-cc0
    Explore at:
    Dataset updated
    Jun 22, 2025
    Dataset authored and provided by
    Spawning
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Megalith-CC0

    A CC0-filtered version of the Megalith-10m dataset. The images have also been persisted to an independent public S3 bucket, supported by the AWS Open Data Registry program, for durability.

      Why filter by CC0?
    

    The images in Megalith-10m, having been gathered from Flickr, have attached licenses of CC0 and public domain. However, it is not clear if users assigning the public domain license to their works understand the implications of the public domain license.… See the full description on the dataset page: https://huggingface.co/datasets/Spawning/megalith-cc0.

  7. MiniPile

    • kaggle.com
    • opendatalab.com
    • +1more
    zip
    Updated Aug 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sebamenabar (2023). MiniPile [Dataset]. https://www.kaggle.com/datasets/sebamenabar/minipile-hf
    Explore at:
    zip(2790187812 bytes)Available download formats
    Dataset updated
    Aug 21, 2023
    Authors
    sebamenabar
    Description

    https://huggingface.co/datasets/JeanKaddour/minipile

    Dataset Card for MiniPile

    Table of Contents

    Dataset Description

    The MiniPile Challenge for Data-Efficient Language Models

    Dataset Summary

    MiniPile is a 6GB subset of the deduplicated The Pile corpus. To curate MiniPile, we perform a simple, three-step data filtering process: we (1) infer embeddings for all documents of the Pile, (2) cluster the embedding space using k-means, and (3) filter out low-quality clusters.

    The primary motivation for curating MiniPile is that (i) diverse pre-training datasets (like the Pile) are often too large for academic budgets and (ii) most smaller-scale datasets are fairly homogeneous and thereby unrepresentative of contemporary general-purpose language models. MiniPile aims to fill this gap and thereby facilitate data-efficient research on model architectures, training procedures, optimizers, etc.

    More details on the MiniPile curation procedure and some pre-training results be found in the MiniPile paper.

    For more details on the Pile corpus, we refer the reader to the Pile datasheet.

    Languages

    English (EN)

    Additional Information

    Dataset Curators

    MiniPile is a subset of the Pile, curated by Jean Kaddour. The Pile was created by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy.

    Licensing Information

    Since MiniPile is a subset of the Pile, the same MIT License holds.

    Citation Information

    @article{kaddour2023minipile,
     title={The MiniPile Challenge for Data-Efficient Language Models},
     author={Kaddour, Jean},
     journal={arXiv preprint arXiv:2304.08442},
     year={2023}
    }
    @article{gao2020pile,
     title={The {P}ile: An 800{GB} dataset of diverse text for language modeling},
     author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others},
     journal={arXiv preprint arXiv:2101.00027},
     year={2020}
    }
    
  8. Kanops Open Retail Imagery - Grocery Dataset

    • kaggle.com
    zip
    Updated Oct 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Steve Dresser (2025). Kanops Open Retail Imagery - Grocery Dataset [Dataset]. https://www.kaggle.com/datasets/stevedresser/kanops-open-retail-imagery-grocery-dataset
    Explore at:
    zip(98224876 bytes)Available download formats
    Dataset updated
    Oct 24, 2025
    Authors
    Steve Dresser
    Description

    🛒 Kanops — Open Access · Retail Imagery Dataset (v0)

    ~10,000 professional retail scene photographs from UK grocery stores for computer vision research

    📊 Quick Stats

    AttributeDetails
    Total Images~10,000 high-resolution photos
    MarketsUnited Kingdom
    Collections2014 archive, Full store surveys, Halloween 2024
    PrivacyAll faces automatically blurred
    LicenseEvaluation & Research Only
    FormatJPEG with comprehensive metadata

    🎯 What Can You Build?

    This dataset is perfect for:

    • 🏪 Shelf Detection - Train models to identify retail fixtures and layouts
    • 📦 Product Recognition - Object detection in dense retail environments
    • 📊 Planogram Analysis - Compare actual vs. planned merchandising
    • 🎃 Seasonal Merchandising - Understand seasonal retail patterns (Halloween collection included)
    • 🤖 Store Navigation - Spatial understanding for retail robotics
    • 🔍 Visual Search - Build retail product search engines
    • 📈 Competitive Intelligence - Benchmark merchandising strategies

    📥 Access the Dataset

    Primary Source: HuggingFace (Gated)

    👉 Request access: https://huggingface.co/datasets/dresserman/kanops-open-access-imagery

    This dataset is gated - request access on HuggingFace. By requesting access, you agree to the evaluation-only license terms.

    Quick Start Code:

    from datasets import load_dataset
    
    # Load the dataset (after getting HuggingFace access)
    ds = load_dataset(
      "imagefolder",
      data_dir="hf://datasets/dresserman/kanops-open-access-imagery/train",
      split="train",
    )
    
    # Access first image
    img = ds[0]["image"] # PIL.Image
    img.show()
    

    Load Metadata:

    import pandas as pd
    
    meta = pd.read_csv(
      "hf://datasets/dresserman/kanops-open-access-imagery/metadata.csv"
    )
    print(meta.head())
    

    📁 Dataset Structure

    train/
    ├── 2014/
    │  ├── Aldi/
    │  ├── Tesco/
    │  ├── Sainsburys/
    │  └── ... (22 UK retailers)
    ├── FullStores/
    │  ├── Tesco_Lincoln_2014/
    │  ├── Tesco_Express_2015/
    │  └── Asda_Leeds_2016/
    └── Halloween2024/
      └── Various_Retailers/
    
    Root files:
    ├── MANIFEST.csv     # File listing + basic attributes
    ├── metadata.csv     # Enriched metadata (retailer, dims, collection)
    ├── checksums.sha256   # Integrity verification
    ├── blur_log.csv     # Face-blur verification log
    └── LICENSE        # Evaluation-only terms
    

    📋 Metadata Schema

    Each image includes comprehensive metadata in metadata.csv:

    FieldDescription
    file_namePath relative to dataset root
    bytesFile size in bytes
    width, heightImage dimensions
    sha256Content hash for integrity verification
    collectionOne of: 2014, FullStores, Halloween2024
    retailerInferred from file path
    yearInferred from file path

    🔒 Privacy & Data Integrity

    • All faces automatically blurred via automated detection + manual review
    • SHA-256 checksums for every image (data integrity)
    • Provenance tracking embedded in EXIF/IPTC/XMP metadata
    • Gated access to ensure license compliance
    • Takedown process available if needed

    📜 License & Usage Terms

    License: Evaluation & Research Only

    What You CAN Do:

    • Use for academic research and publications
    • Train and evaluate computer vision models (non-commercial)
    • Benchmark algorithm performance
    • Educational and learning purposes
    • Prototype development under evaluation terms

    What You CANNOT Do:

    • Redistribute the dataset or derivatives
    • Use for commercial production systems
    • Publicly release model weights trained on this data (without commercial license)
    • Marketing or brand endorsement

    For commercial licensing: Contact happytohelp@groceryinsight.com

    🏢 About This Sample Dataset

    This free sample is part of Kanops Archive - a much larger commercial dataset used by AI companies and research institutions.

    This Free Sample (v0):

    • ~10,000 images from UK retailers only
    • 2014-2024 timeframe
    • Evaluation and research use only

    Full Commercial Dataset (RetailVision Archive):

    • 1M+ images spanning 2011-2025
    • 5 geographic markets: UK, Ireland, Netherlands, Germany, USA
    • 280K+ seasonal images with granular categorization
    • 15 years of retail evolution and trends
    • Professional curation by retail industry experts

    Applications: - Training production computer vision models - Autonomous checkout systems - Retail robotics and automation - Seasonal demand forecasting - Market research and competitive intelligence

    Learn more: [groceryinsight.com/retail-image-dataset](...

  9. h

    trpfrog-icons

    • huggingface.co
    Updated Feb 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kasana Orikawa (2023). trpfrog-icons [Dataset]. https://huggingface.co/datasets/trpfrog/trpfrog-icons
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 10, 2023
    Authors
    Kasana Orikawa
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    trpfrog-icons Dataset

    This is a dataset of TrpFrog's icons. By the way, what do you use this for? 🤔

      How to use
    

    from datasets import load_dataset

    dataset = load_dataset("TrpFrog/trpfrog-icons")

    print all data

    for data in dataset["train"]: print(data)

    remove not green icons

    dataset = dataset.filter(lambda x: x["label"] == 0)

      License
    

    MIT License

  10. Alpaca

    • kaggle.com
    zip
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Alpaca [Dataset]. https://www.kaggle.com/datasets/thedevastator/alpaca-instructions-word-level-classification
    Explore at:
    zip(26297842 bytes)Available download formats
    Dataset updated
    Nov 24, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Alpaca

    Alpaca - Training LLMs to follow instructions

    By Huggingface Hub [source]

    About this dataset

    This dataset, TokenBender: 122k Alpaca-Style Instructions Word-Level Classification Towards Accurate Natural Language Understanding, provides a comprehensive collection of 122K Alpaca-style instructions with their associated input, text and output for word-level classification. It enables natural language understanding research to be done conveniently as it contains entries from diverse areas such as programming code instructions and gaming instructions that are written in varying levels of complexity. With the help of this dataset, developers aiming to apply natural language processing techniques for machines may gain insight into how to improve the accuracy and facilitate the comprehension of human language commands. By using this dataset, one may develop advanced algorithms such as neural networks or decision trees that can quickly understand commands in foreign languages and bridge the gap between machines and humans for different practical purposes

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains 122k Alpaca-Style Instructions with their corresponding input, text, and output for word-level classification. It is a valuable resource to those who wish to gain insight into natural language understanding through data science approaches. This guide will provide some tips on how to use this dataset in order to maximize accuracy and gain a better understanding of natural language.

    • Preprocessing: Cleaning the data is an essential step when dealing with any sort of text data which includes the Alpaca instructions dataset. This involves removing stopwords like articles, pronouns, etc., normalizing words such as capitalization or lemmatization, filtering for relevant terms based on context or key problems you are trying to solve; and finally tokenizing the remaining text into appropriate individual pieces that can be provided as input features for different models – SentencePiece is perfect for this sort of task.

    • Feature extraction: After preprocessing your text data it’s time to extract insightful features from it utilizing techniques like Bag-of-Words (BOW), Term Frequency - Inverse Document Frequency (TF-IDF) Vectorizer etc., which might help you better understand the context behind each instruction sentence/word within the corpus. Additionally embedding techniques using word2vec/GloVe might also serve useful in extracting semantic information from these instructions while helping build classifiers successful at predicting word level categories related tasks (Semantic segmentation).

    • Model selection: Depending on your problem setup AI architectures like Support vector machines(SVMs)/Conditional Random Fields(CRFs)/ Attention Based Models should work well in tackling these types of tasks related towards NLP analysis at both sentence or shallow representation form levels (Part Of Speech tagging). If learning what words are used together efficiently matters more than all other options then selecting an RNN model such as LSTM or GRU might do wonders; they are similarly effective but faster modelling approach due its recursive structure that allows you store context information more effectively compared BOWs or TFIDF Vectors spaces separately built up during feature engineering processing periods per individual supervised training tasks points instead across all!

    • Evaluating Results: After choosing the best algorithm model fit analysis performance measures such as F1 scores should enable easier tracking end goal results adjustments if needed precision/recall levels are declining significantly past certain number values threshold points compared lower task confirming holding out uncategorized sample documents versus larger ID test portion splits train tests datasets subsets collected

    Research Ideas

    • Developing an AI-based algorithm capable of accurately understanding the meaning of natural language instructions.
    • Using this dataset for training and testing machine learning models to classify specific words and phrases within natural language instructions.
    • Training a deep learning model to generate visual components based on the given input, text, and output values from this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Univer...

  11. finevideo

    • huggingface.co
    Updated Sep 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hugging Face FineVideo (2024). finevideo [Dataset]. https://huggingface.co/datasets/HuggingFaceFV/finevideo
    Explore at:
    Dataset updated
    Sep 12, 2024
    Dataset provided by
    Hugging Facehttps://huggingface.co/
    Authors
    Hugging Face FineVideo
    License

    https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/

    Description

    FineVideo

    FineVideo Description Dataset Explorer Revisions Dataset Distribution

    How to download and use FineVideo Using datasets Using huggingface_hub Load a subset of the dataset

    Dataset StructureData Instances Data Fields

    Dataset Creation License CC-By Considerations for Using the Data Social Impact of Dataset Discussion of Biases

    Additional Information Credits Future Work Opting out of FineVideo Citation Information

    Terms of use for FineVideo… See the full description on the dataset page: https://huggingface.co/datasets/HuggingFaceFV/finevideo.

  12. Synthia-v1.3

    • kaggle.com
    • huggingface.co
    zip
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Synthia-v1.3 [Dataset]. https://www.kaggle.com/datasets/thedevastator/human-machine-dialogue-interactions
    Explore at:
    zip(79056480 bytes)Available download formats
    Dataset updated
    Nov 22, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Human-Machine Dialogue Interactions

    Exploring Communication Models for Machine Learning

    By Huggingface Hub [source]

    About this dataset

    This Synthia-v1.3 dataset provides insight into the complexities of human-machine communication through its collection of dialogue interactions between humans and machines. Contained within this dataset are details on how conversations develop between the two, detailing behavioural changes in both humans and machines towards one another over time. With information provided on both user instructions to machines, as well as the system, machine responses and other related data points, this dataset offers a detailed overview of machine learning concepts, examining how systems utilise dialogue to interact with people in various scenarios. This can offer valuable insight into how predictive intelligence is applied by these systems in conversational settings, better informing developers seeking to build their own human-machine interfaces for effective two-way communication. By looking at this data set as a whole it can create an understanding of the way connections form between humans and machines providing a deeper level of appreciation for ongoing challenges faced when working on projects with these technological components at play

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    The dataset consists of a collection of dialogue interactions between humans and machines, providing insight into human-machine communication. It includes information about the system being used, instructions given by humans to machines and responses from machines.

    To start using this data set: - Download the csv file containing all of the dialogue interactions from Kaggle datasets page. - Open up your favourite spreadsheet software like Excel or Google Sheets and load up the CSV file - Take a look at each of the columns listed in order to familiarize yourself with what they contain: ‘system’ column contains details about what system was used for role play between human and machine; ‘instruction’ column contains instructions given by humans to machines; ‘response’ column contains responses from machines back to humans based on their instructions
    - Start exploring how conversations progress between humans and machine over time by examining information in each of these columns separately or together as required

    You can also filter out specific conditions within your data set such as searching for conversations that were driven entirely by particular systems or involving certain instruction types etc. In addition, you have an opportunity conduct various kinds of analysis such as statistical analysis (e.g., descriptive statistics or correlation analysis). With so many possibilities for exploration, you are sure find something interesting!

    Research Ideas

    • Utilizing the dataset to understand how various types of instruction styles can influence conversation order and flow between humans and machines.
    • Using the data to predict potential responses in a given dialogue interaction from varying sources, such as robots or virtual assistants.

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------|:--------------------------------------------------------------| | system | The type of system used in the dialogue interaction. (String) | | instruction | The instruction given by the human to the machine. (String) | | response | The response given by the machine to the human. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  13. Social IQa (Social Interaction Q&A)

    • kaggle.com
    zip
    Updated Nov 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Social IQa (Social Interaction Q&A) [Dataset]. https://www.kaggle.com/datasets/thedevastator/social-i-qa-a-dataset-for-social-inquiry-questio/discussion
    Explore at:
    zip(2024126 bytes)Available download formats
    Dataset updated
    Nov 20, 2022
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Social IQa (Social Interaction Q&A)

    Question-answering benchmark for testing commonsense social intelligence

    Source

    Huggingface Hub: link

    About this dataset

    We introduce Social IQa: Social Interaction QA, a new question-answering benchmark for testing social commonsense intelligence. Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications. For example, given an action like "Jesse saw a concert" and a question like "Why did Jesse do this?", humans can easily infer that Jesse wanted "to see their favorite performer" or "to enjoy the music", and not "to see what's happening inside" or "to see if it works". The actions in Social IQa span a wide variety of social situations, and answer candidates contain both human-curated answers and adversarially-filtered machine-generated candidates. Social IQa contains over 37,000 QA pairs for evaluating models’ abilities to reason about the social implications of everyday events and situations. (Less)

    How to use the dataset

    This dataset can be used to train and test models for social inquiry question answering. The questions and answers in the dataset have been annotations by experts, and the dataset has been verified for accuracy.

    Research Ideas

    • The dataset can be used to train a model to answer questions about social topics.
    • The dataset can be used to improve question-answering systems for social inquiry.
    • The dataset can be used to generate new questions about social topics

    Acknowledgements

    Huggingface Hub: link

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: validation.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |

    File: train.csv | Column name | Description | |:--------------|:------------------------------------------------------| | context | The context of the question. (String) | | answerA | One of the possible answers to the question. (String) | | answerB | One of the possible answers to the question. (String) | | answerC | One of the possible answers to the question. (String) | | label | The correct answer to the question. (String) |

  14. CCDV Arxiv Summarization Dataset

    • kaggle.com
    zip
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). CCDV Arxiv Summarization Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/ccdv-arxiv-summarization-dataset
    Explore at:
    zip(2219742528 bytes)Available download formats
    Dataset updated
    Dec 5, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    CCDV Arxiv Summarization Dataset

    Arxiv Summarization Dataset for CCDV

    By ccdv (From Huggingface) [source]

    About this dataset

    The validation.csv file contains a set of articles along with their respective abstracts that can be used for validating the performance of summarization models. This subset allows researchers to fine-tune their models and measure how well they can summarize scientific texts.

    The train.csv file serves as the primary training data for building summarization models. It consists of numerous articles extracted from the Arxiv database, paired with their corresponding abstracts. By utilizing this file, researchers can develop and train various machine learning algorithms to generate accurate summaries of scientific papers.

    Lastly, the test.csv file provides a separate set of articles with accompanying abstracts specifically intended for evaluating the performance and effectiveness of summarization models developed using this dataset. Researchers can utilize this test set to conduct rigorous evaluations and benchmark different approaches in automatic document summarization.

    With columns labeled as article and abstract, each corresponding to multiple repetitions in order to allow detailed analysis or multiple variations if required by users (e.g., different proposed summaries), this dataset provides significant flexibility in developing robust models for summarizing complex scientific documents.

    How to use the dataset

    • Introduction:

    • File Description:

    • validation.csv: This file contains articles and their respective abstracts that can be used for validation purposes.

    • train.csv: The purpose of this file is to provide training data for summarizing scientific articles.

    • test.csv: This file includes a set of articles and their corresponding abstracts that can be used to evaluate the performance of summarization models.

    • Dataset Structure: The dataset consists of multiple columns, including article, article, abstract, abstract, article, article, abstract, abstract, article, and article columns.

    • Usage Examples: This dataset can be utilized in various ways:

    a) Training Models: You can use the train.csv file to train your own model for summarizing scientific articles from the Arxiv database. The article column provides the full text of each scientific paper, while the abstract column contains its summary.

    b) Validation: The validation.csv file allows you to validate your trained models by comparing their generated summaries with the provided reference summaries in order to assess their performance.

    c) Evaluation: Utilize the test.csv file as a benchmark for evaluating different summarization models. Generate summaries using your selected model and compare them with reference summaries.

    • Evaluating Performance: To measure how well your summarization model performs on this dataset, you can use metrics like ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE measures overlap between generated summaries and reference summaries based on n-gram co-occurrence statistics.

    Note: Please ensure that you do not include any dates in your guide or refer specifically to any particular versions/examples within this dataset as it may require regular updates/revisions

    Research Ideas

    • Summarizing scientific articles: This dataset can be used to train and evaluate summarization models for the task of generating concise summaries of scientific articles from the Arxiv database. Researchers can utilize this dataset to develop novel techniques and approaches for automatic summarization in the scientific domain.
    • Information retrieval: The dataset can be used to enhance search engines or information retrieval systems by providing concise summaries along with the full text of scientific articles. This would enable users to quickly grasp key information without having to read the entire article, improving accessibility and efficiency.
    • Text generation research: Researchers interested in natural language processing and text generation can use this dataset as a benchmark for developing new models and algorithms that generate coherent, informative, and concise summaries from lengthy scientific texts. The dataset provides a diverse range of articles across various domains, allowing researchers to explore different challenges in summary generation

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **License: [CC0 1.0 Universal (CC0 1.0) - Public Domain...

  15. Amazon Product Reviews

    • kaggle.com
    zip
    Updated Nov 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Amazon Product Reviews [Dataset]. https://www.kaggle.com/datasets/thedevastator/amazon-product-reviews
    Explore at:
    zip(699806296 bytes)Available download formats
    Dataset updated
    Nov 26, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Amazon Product Reviews

    18 Years of Customer Ratings and Experiences

    By Huggingface Hub [source]

    About this dataset

    The Amazon Reviews Polarity Dataset discloses eighteen years of customers' ratings and reviews from Amazon.com, offering an unparalleled trove of insight and knowledge. Drawing from the immense pool of over 35 million customer reviews, this dataset presents a broad spectrum of customer opinions on products they have bought or used. This invaluable data is a gold mine for improving products and services as it contains comprehensive information regarding customers' experiences with a product including ratings, titles, and plaintext content. At the same time, this dataset contains both customer-specific data along with product information which encourages deep analytics that could lead to great advances in providing tailored solutions for customers. Has your product been favored by the majority? Are there any aspects that need extra care? Use Amazon Reviews Polarity to gain deeper insights into what your customers want - explore now!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    • Analyze customer ratings to identify trends: Take a look at how many customers have rated the same product or service with the same score (e.g., 4 stars). You can use this information to identify what customers like or don’t like about it by examining common sentiment throughout the reviews. Identifying these patterns can help you make decisions on which features of your products or services to emphasize in order to boost sales and satisfaction rates.

    2 Review content analysis: Analyzing review content is one of the best ways to gauge customer sentiment toward specific features or aspects of a product/service. Using natural language processing tools such as Word2Vec, Latent Dirichlet Allocation (LDA), or even simple keyword search algorithms can quickly reveal general topics that are discussed in relation to your product/service across multiple reviews - allowing you quickly pinpoint areas that may need improvement for particular items within your lines of business.

    3 Track associated scores over time: By tracking customer ratings overtime, you may be able to better understand when there has been an issue with something specific related to your product/service - such as negative response toward a feature that was introduced but didn’t seem popular among customers and was removed shortly after introduction.. This can save time and money by identifying issues before they become widespread concerns with larger sets of consumers who invest their money in using your company's item(s).

    4 Visualize sentiment data over time graphs : Utilizing visualizations such as bar graphs can help identify trends across different categories quicker than raw numbers alone; combining both numeric values along with color differences associated between different scores allows you spot anomalies easier - allowing faster resolution times when trying figure out why certain spikes occurred where other stayed stable (or vice-versa) when comparing similar data points through time-series based visualization models

    Research Ideas

    • Developing a customer sentiment analysis system that can be used to quickly analyze the sentiment of reviews and identify any potential areas of improvement.
    • Building a product recommendation service that takes into account the ratings and reviews of customers when recommending similar products they may be interested in purchasing.
    • Training a machine learning model to accurately predict customers’ ratings on new products they have not yet tried and leverage this for further product development optimization initiatives

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:-------------------------------------------------------------------| | label | The sentiment of the review, either positive or negative. (String) | | title | The title of the review. (String) ...

  16. MetaMath QA

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). MetaMath QA [Dataset]. https://www.kaggle.com/datasets/thedevastator/metamathqa-performance-with-mistral-7b
    Explore at:
    zip(78629842 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    MetaMath QA

    Mathematical Questions for Large Language Models

    By Huggingface Hub [source]

    About this dataset

    This dataset contains meta-mathematics questions and answers collected from the Mistral-7B question-answering system. The responses, types, and queries are all provided in order to help boost the performance of MetaMathQA while maintaining high accuracy. With its well-structured design, this dataset provides users with an efficient way to investigate various aspects of question answering models and further understand how they function. Whether you are a professional or beginner, this dataset is sure to offer invaluable insights into the development of more powerful QA systems!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    Data Dictionary

    The MetaMathQA dataset contains three columns: response, type, and query. - Response: the response to the query given by the question answering system. (String) - Type: the type of query provided as input to the system. (String) - Query:the question posed to the system for which a response is required. (String)

    Preparing data for analysis

    It’s important that before you dive into analysis, you first familiarize yourself with what kind data values are present in each column and also check if any preprocessing needs to be done on them such as removing unwanted characters or filling in missing values etc., so that it can be used without any issue while training or testing your model further down in your process flow.

    ##### Training Models using Mistral 7B

    Mistral 7B is an open source framework designed for building machine learning models quickly and easily from tabular (csv) datasets such as those found in this dataset 'MetaMathQA ' . After collecting and preprocessing your dataset accordingly Mistral 7B provides with support for various Machine Learning algorithms like Support Vector Machines (SVM), Logistic Regression , Decision trees etc , allowing one to select from various popular libraries these offered algorithms with powerful overall hyperparameter optimization techniques so soon after selecting algorithm configuration its good practice that one use GridSearchCV & RandomSearchCV methods further tune both optimizations during model building stages . Post selection process one can then go ahead validate performances of selected models through metrics like accuracy score , F1 Metric , Precision Score & Recall Scores .

    ##### Testing phosphors :

    After successful completion building phase right way would be robustly testing phosphors on different evaluation metrics mentioned above Model infusion stage helps here immediately make predictions based on earlier trained model OK auto back new test cases presented by domain experts could hey run quality assurance check again base score metrics mentioned above know asses confidence value post execution HHO updating baseline scores running experiments better preferred methodology AI workflows because Core advantage finally being have relevancy inexactness induced errors altogether impact low

    Research Ideas

    • Generating natural language processing (NLP) models to better identify patterns and connections between questions, answers, and types.
    • Developing understandings on the efficiency of certain language features in producing successful question-answering results for different types of queries.
    • Optimizing search algorithms that surface relevant answer results based on types of queries

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:--------------|:------------------------------------| | response | The response to the query. (String) | | type | The type of query. (String) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Huggingface Hub.

  17. h

    falcon-refinedweb

    • huggingface.co
    • opendatalab.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technology Innovation Institute, falcon-refinedweb [Dataset]. http://doi.org/10.57967/hf/0737
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Technology Innovation Institute
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    📀 Falcon RefinedWeb

    Falcon RefinedWeb is a massive English web dataset built by TII and released under an ODC-By 1.0 license. See the 📓 paper on arXiv for more details. RefinedWeb is built through stringent filtering and large-scale deduplication of CommonCrawl; we found models trained on RefinedWeb to achieve performance in-line or better than models trained on curated datasets, while only relying on web data. RefinedWeb is also "multimodal-friendly": it contains links and alt… See the full description on the dataset page: https://huggingface.co/datasets/tiiuae/falcon-refinedweb.

  18. LLM Feedback Collection

    • kaggle.com
    zip
    Updated Nov 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). LLM Feedback Collection [Dataset]. https://www.kaggle.com/datasets/thedevastator/fine-grained-gpt-4-evaluation
    Explore at:
    zip(159502027 bytes)Available download formats
    Dataset updated
    Nov 23, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    LLM Feedback Collection

    Induce fine-grained evaluation capabilities into language models

    By Huggingface Hub [source]

    About this dataset

    This dataset contains 100,000 feedback responses from GPT-4 AI models along with rubrics designed to evaluate both absolute and ranking scores. Each response is collected through a comprehensive evaluation process that takes into account the model's feedback, instruction, criteria for scoring, referenced answer and input given. This data provides researchers and developers with valuable insights into the performance of their AI models on various tasks as well as the ability to compare them against one another using precise and accurate measures. Each response is accompanied by five descriptive scores that give a detailed overview of its quality in terms of relevance to the input given, accuracy in reference to the reference answer provided, coherence between different parts of the output such as grammar and organization, fluency in expression of ideas without errors or unnecessary repetitions, and overall productivity accounting for all other factors combined. With this dataset at your disposal, you will be able to evaluate each output qualitatively without having to manually inspect every single response

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset contains feedback from GPT-4 models, along with associated rubrics for absolute and ranking scoring. It can be used to evaluate the performance of GPT-4 models on different challenging tasks.

    In order to use this dataset effectively, it is important to understand the data provided in each column: - orig_feedback – Feedback given by the original GPT-4 model - orig_score2_description – Description of the second score given to the original GPT-4 model - orig_reference_answer – Reference answer used to evaluate the original GPT-4 model
    - output – Output from the fine-grained evaluation
    - orig_response – Response from the original GPT-4 model * orig_criteria – Criteria used to evaluate the original GPT-4 model *orig_instruction– Instruction given to the original GPT 4 model *orig_score3 _description– Description of third score given to

    Research Ideas

    • Data-driven evaluation of GPT-4 models using the absolute and ranking scores collected from this dataset.
    • Training a deep learning model to automate the assessment of GPT-4 responses based on the rubrics provided in this dataset.
    • Building a semantic search engine using GPT-4 that is able to identify relevant responses more accurately with the help of this dataset's data collection metrics and rubrics for scoring

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: train.csv | Column name | Description | |:----------------------------|:---------------------------------------------------------------| | orig_feedback | Feedback from the evaluator. (Text) | | orig_score2_description | Description of the second score given by the evaluator. (Text) | | orig_reference_answer | Reference answer used to evaluate the model response. (Text) | | output | Output from the GPT-4 model. (Text) | | orig_response | Original response from the GPT-4 model. (Text) | | orig_criteria | Criteria used by the evaluator to rate the response. (Text) | | orig_instruction | Instructions provided by the evaluator. (Text) | | orig_score3_description | Description of the third score given by the evaluator. (Text) | | orig_score5_description | Description of the fifth score given by the evaluator. (Text) | | orig_score1_description | Description of the first score given by the evaluator. (Text) | | input | Input given to the evaluation. (Text) | | orig_score4_description | Description of the fourth score given by the evalua...

  19. h

    the-stack-dedup

    • huggingface.co
    • opendatalab.com
    Updated Oct 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    BigCode (2022). the-stack-dedup [Dataset]. https://huggingface.co/datasets/bigcode/the-stack-dedup
    Explore at:
    Dataset updated
    Oct 27, 2022
    Dataset authored and provided by
    BigCode
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for The Stack

      Changelog
    

    Release Description

    v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 1.5TB in size.

    v1.1 The three copyleft licenses ((MPL/EPL/LGPL) were excluded and the list of permissive licenses extended to 193 licenses in total. The list of programming… See the full description on the dataset page: https://huggingface.co/datasets/bigcode/the-stack-dedup.

  20. tulu-3-ultrafeedback-cleaned-on-policy-8b

    • huggingface.co
    Updated Nov 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2025). tulu-3-ultrafeedback-cleaned-on-policy-8b [Dataset]. https://huggingface.co/datasets/allenai/tulu-3-ultrafeedback-cleaned-on-policy-8b
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 29, 2025
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    Description

    Llama 3.1 Tulu 3 Ultrafeedback (Cleaned) (on-policy 8B)

    Note that this collection is licensed under ODC-BY-1.0 license; different licenses apply to subsets of the data. Some portions of the dataset are non-commercial. We present the mixture as a research artifact. This preference dataset is part of our Tulu 3 preference mixture. It contains prompts from Ai2's cleaned version of Ultrafeedback which removes instances of TruthfulQA. We further filtered this dataset to remove… See the full description on the dataset page: https://huggingface.co/datasets/allenai/tulu-3-ultrafeedback-cleaned-on-policy-8b.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jacob Morrison (2025). sciriff-license-filtered-final-commercial [Dataset]. https://huggingface.co/datasets/jacobmorrison/sciriff-license-filtered-final-commercial

sciriff-license-filtered-final-commercial

jacobmorrison/sciriff-license-filtered-final-commercial

Explore at:
Dataset updated
Oct 20, 2025
Authors
Jacob Morrison
Description

jacobmorrison/sciriff-license-filtered-final-commercial dataset hosted on Hugging Face and contributed by the HF Datasets community

Search
Clear search
Close search
Google apps
Main menu