100+ datasets found
  1. h

    dataset-preferences-llm-course-full-dataset

    • huggingface.co
    Updated May 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel van Strien (2024). dataset-preferences-llm-course-full-dataset [Dataset]. https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 31, 2024
    Authors
    Daniel van Strien
    Description

    Dataset Card for dataset-preferences-llm-course-full-dataset

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.

  2. Training_Data_FineTuning_LLM_PEFT_LORA

    • kaggle.com
    zip
    Updated Aug 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rupak Roy/ Bob (2024). Training_Data_FineTuning_LLM_PEFT_LORA [Dataset]. https://www.kaggle.com/datasets/rupakroy/training-dataset-peft-lora
    Explore at:
    zip(29562174 bytes)Available download formats
    Dataset updated
    Aug 8, 2024
    Authors
    Rupak Roy/ Bob
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.

    The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum

  3. LLM Prompt Recovery - Synthetic Datastore

    • kaggle.com
    zip
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
    Explore at:
    zip(988448 bytes)Available download formats
    Dataset updated
    Feb 29, 2024
    Authors
    Darien Schettler
    License

    https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses

    Description

    High Level Description

    This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

    Contributors

    Please go upvote these other datasets as my work is not possible without them

    First Dataset - 1000 Examples From @thedrcat

    Update 1 - February 29, 2024

    The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

    The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

    ' sentence removed.

    I generated things using the following setup:

    # I used a vLLM server to host Gemma 7B on paperspace (A100)
    
    # Step 1 - Install vLLM
    >>> pip install vllm
    
    # Step 2 - Authenticate HuggingFace CLI (for model weights)
    >>> huggingface-cli login --token
    
  4. resume-screening-llm-training-dataset

    • kaggle.com
    zip
    Updated Sep 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Syncora_ai (2025). resume-screening-llm-training-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/resume-screening-llm-training-dataset
    Explore at:
    zip(60353 bytes)Available download formats
    Dataset updated
    Sep 11, 2025
    Authors
    Syncora_ai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Resume Screening & HR Conversations Dataset for LLM Training

    Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.

    It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.

    By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

    âś… Why This Dataset?

    • Accelerate HR AI development – No need to source or anonymize real resumes.
    • Ready for LLM Training – Structured in OpenAI-compatible JSONL format.
    • Privacy-Safe & Scalable – 100% synthetic, zero PII.

    đź“‚ Dataset Description

    This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.

    • Format: JSONL (OpenAI fine-tuning compatible)
    • Schema:
      • messages: An array of chat messages with roles (system, user, assistant) and their respective content.

    Example:

    {
     "messages": [
      {"role": "system", "content": "You are an informative assistant."},
      {"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
      {"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
     ]
    }
    

    🔍 What's Inside

    • Resume Screening Q&A
      • Job titles, experience, role responsibilities.
    • Career Guidance Questions
      • Tech trends, upskilling, and career planning.

    👥 Who Should Use This Dataset?

    • Recruitment Tech Startups – Build automated candidate screeners.
    • HR Teams – Deploy internal career chatbots.
    • AI Engineers & Researchers – Fine-tune LLMs for HR applications.
    • EdTech Platforms – Power career advisory assistants.

    🚀 How To Use

    Fine-tune with OpenAI: bash openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"

    âś… Why Synthetic?

    • No compliance headaches (GDPR, PII safe)
    • Faster experimentation for LLM applications
    • High-fidelity HR scenarios for realistic model behavior

    đź”— Start Generating Your Own Synthetic Data

    → Generate your own synthetic datasets now

  5. h

    clinical-synthetic-text-llm

    • huggingface.co
    Updated Jul 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 5, 2024
    Authors
    Ran Xu
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Data Description

    We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

      Generated Datasets
    

    The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.

  6. h

    HarmfulQA

    • huggingface.co
    Updated Aug 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deep Cognition and Language Research (DeCLaRe) Lab (2023). HarmfulQA [Dataset]. https://huggingface.co/datasets/declare-lab/HarmfulQA
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 20, 2023
    Dataset authored and provided by
    Deep Cognition and Language Research (DeCLaRe) Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣

    As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.

  7. deepseek-llm-7b-base

    • kaggle.com
    zip
    Updated Jan 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2025). deepseek-llm-7b-base [Dataset]. https://www.kaggle.com/datasets/younusmohamed/deepseek-llm-7b-base
    Explore at:
    zip(21941318019 bytes)Available download formats
    Dataset updated
    Jan 30, 2025
    Authors
    Younus_Mohamed
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    DeepSeek Model Dataset

    Overview

    This dataset contains the DeepSeek model, a [brief description of the model, e.g., "state-of-the-art language model for natural language processing tasks"]. The model is designed for [specific use cases, e.g., "text generation, sentiment analysis, etc."].

    Contents

    • model_weights/: Directory containing the model weights.
    • config.json: Configuration file for the model.
    • inference_example.ipynb: Jupyter Notebook demonstrating how to load and use the model.
    • requirements.txt: List of Python dependencies.

    Usage

    1. Download the dataset from Kaggle.
    2. Install the required dependencies using pip install -r requirements.txt.
    3. Open the inference_example.ipynb notebook to see how to load the model and perform inference.

    License

    This dataset is licensed under [license name, e.g., "MIT License"]. See the LICENSE file for more details.

    Acknowledgments

    • Deepseek

    Contact

    For questions or issues, please contact

  8. d

    Customer Service Call Dataset [Multisector] – Annotated support transcripts...

    • datarade.ai
    Updated Apr 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
    Explore at:
    .json, .csv, .xls, .txtAvailable download formats
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    WiserBrand
    Area covered
    United States of America
    Description

    "This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

    Included in each record:

    • Full call transcription with labeled speakers (system, agent, customer)
    • Concise human-written summary of the conversation
    • Sentiment tag for the overall interaction: positive, neutral, or negative
    • Company name, duration, and geographic location of the caller
    • Call context includes industries such as eCommerce, banking, telecom, and streaming services

    Common use cases:

    • Train NLP models to understand support calls and detect churn risk
    • Power complaint detection engines for customer success and support teams
    • Create high-quality LLM training sets with real support narratives
    • Build summarization and topic tagging pipelines for CX dashboards
    • Analyze tone shifts and resolution language in customer-agent interaction

    This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

    The more you purchase, the lower the price will be.

  9. f

    Main Data and Code

    • figshare.com
    zip
    Updated Oct 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Momo (2025). Main Data and Code [Dataset]. http://doi.org/10.6084/m9.figshare.29929412.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 5, 2025
    Dataset provided by
    figshare
    Authors
    Momo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py:Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.

  10. h

    fine-tuning-llm

    • huggingface.co
    Updated Oct 31, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    H (2024). fine-tuning-llm [Dataset]. https://huggingface.co/datasets/KarinaH/fine-tuning-llm
    Explore at:
    Dataset updated
    Oct 31, 2024
    Authors
    H
    Description

    Dataset Card for fine-tuning-llm

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/KarinaH/fine-tuning-llm/raw/main/pipeline.yaml"

    or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/KarinaH/fine-tuning-llm.

  11. h

    Hermes-3-Llama-3.1-8B-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). Hermes-3-Llama-3.1-8B-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/Hermes-3-Llama-3.1-8B-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of NousResearch/Hermes-3-Llama-3.1-8B

    Dataset automatically created during the evaluation run of model NousResearch/Hermes-3-Llama-3.1-8B The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Hermes-3-Llama-3.1-8B-details.

  12. AI Safety Verification Dataset

    • kaggle.com
    zip
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Priyam Saha (2025). AI Safety Verification Dataset [Dataset]. https://www.kaggle.com/datasets/priyamsaha17/ai-safety-verification-dataset
    Explore at:
    zip(147706374 bytes)Available download formats
    Dataset updated
    Aug 25, 2025
    Authors
    Priyam Saha
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview An aggregated, cleaned, and unified dataset assembled from the AI Verify Foundation’s Project Moonshot resources on Kaggle. It combines: (a) prompt templates and prompt-engineering cookbooks, (b) pre-built recipes used to configure benchmark runs (input/target pairs, evaluation metric, grading scales), and (c) metric definitions/outputs for automated evaluation. The material is intended to support reproducible LLM benchmarking, bias/fairness analysis, and prompt-engineering experiments.

    Project Moonshot Project Moonshot is an open-source LLM evaluation toolkit produced by the AI Verify Foundation; it brings benchmarking and red-teaming workflows together and publishes prompt templates, recipes and metrics on GitHub and the Moonshot docs site. Link - https://aiverifyfoundation.sg/project-moonshot/

    Recipe Recipes (in Moonshot) are pre-built benchmark configurations: JSON files that define the dataset (input / target pairs), the prompt template to use, the evaluation metric, and any grading thresholds — enabling reproducible, repeatable test runs. The Moonshot project publishes many such recipes for different evaluation categories (e.g., prompt injection, cybersecurity).

    Cookbook Cookbook (in ML/prompting context) is a curated collection of patterns, examples and “how-to” snippets for solving common tasks with LLMs (templates, best practices, and worked examples). Think of a cookbook as a higher-level collection that organizes recipes and templates for reuse

    Intended uses - Reproducible LLM benchmarking and regression testing. - Bias and fairness audits (compare performance across social attribute groups). - Prompt engineering research (compare prompt templates / recipe variants). - Building evaluation pipelines that combine semantic and factual checks.

    Credits: This dataset aggregates content published by the AI Verify Foundation / Project Moonshot. Please follow the original project’s license and attribution requirements when redistributing. See the Moonshot repository for license details. URL: https://aiverifyfoundation.sg/project-moonshot/ GitHub: https://github.com/aiverify-foundation/moonshot

  13. h

    Bitext-retail-banking-llm-chatbot-training-dataset

    • huggingface.co
    Updated Jul 16, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bitext (2024). Bitext-retail-banking-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 16, 2024
    Dataset authored and provided by
    Bitext
    License

    https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/

    Description

    Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants

      Overview
    

    This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.

  14. Data from: AstroChat

    • kaggle.com
    • huggingface.co
    zip
    Updated Jun 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
    Explore at:
    zip(1214166 bytes)Available download formats
    Dataset updated
    Jun 9, 2024
    Authors
    astro_pat
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Purpose and Scope

    The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

    Intended Use

    The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

    Quickstart

    To be completed

    DATASET DESCRIPTION

    Access

    Structure

    901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

    Important See the full list of topics and subtopics covered below.

    Metadata

    Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

    Generation Method

    We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

    Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

    Step-by-step description

    • Defined a set of user persona
    • Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering
    • For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)
    • For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)
    • We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions
    • We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

    Future work and contributions appreciated

    • Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)
    • Implement more creativity in the opening questions and follow-up questions
    • Filter-out questions and conversations which are too similar
    • Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

    Languages

    All instances in the dataset are in english

    Size

    901 synthetically-generated dialogue

    USAGE AND GUIDELINES

    License

    AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

    Restrictions

    No restriction. Please provide the correct attribution following the license terms.

    Citation

    Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

    Update Frequency

    Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

    Have a feedback or spot an error?

    Use the ...

  15. h

    deepseek-ai_deepseek-llm-7b-base-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). deepseek-ai_deepseek-llm-7b-base-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/deepseek-ai_deepseek-llm-7b-base-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of deepseek-ai/deepseek-llm-7b-base

    Dataset automatically created during the evaluation run of model deepseek-ai/deepseek-llm-7b-base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/deepseek-ai_deepseek-llm-7b-base-details.

  16. Article Dataset (Mini)

    • kaggle.com
    zip
    Updated Oct 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sani Kamal (2024). Article Dataset (Mini) [Dataset]. https://www.kaggle.com/datasets/sanikamal/article-50/code
    Explore at:
    zip(3563613 bytes)Available download formats
    Dataset updated
    Oct 18, 2024
    Authors
    Sani Kamal
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    This dataset contains 50 articles sourced from Medium, focusing on AI-related content. It is designed for business owners, content creators, and AI developers looking to analyze successful articles, improve engagement, and fine-tune AI language models (LLMs). The data can be used to explore what makes articles perform well, including sentiment analysis, follower counts, and headline effectiveness.

    Dataset Contents

    • articles_50.db - Sample database with 50 articles(Free Version)

    The database includes pre-analyzed data such as sentiment scores, follower counts, and headline metadata, helping users gain insights into high-performing content.

    Use Cases

    • Content Strategy Optimization: Identify trends in successful AI-related articles to enhance your content approach.
    • Headline Crafting: Study patterns in top-performing headlines to create more compelling article titles.
    • LLM Fine-Tuning: Utilize the dataset to fine-tune AI models with real-world data on content performance.
    • Sentiment-Driven Content: Create content that resonates with readers by aligning with sentiment insights.

    This dataset is a valuable tool for anyone aiming to harness the power of data-driven insights to enhance their content or AI models.

  17. h

    experiment-llm_exp-3-q-r-details

    • huggingface.co
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). experiment-llm_exp-3-q-r-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/experiment-llm_exp-3-q-r-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of experiment-llm/exp-3-q-r

    Dataset automatically created during the evaluation run of model experiment-llm/exp-3-q-r The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/experiment-llm_exp-3-q-r-details.

  18. Logical Reasoning Improvement Dataset

    • kaggle.com
    zip
    Updated Nov 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Logical Reasoning Improvement Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/logical-reasoning-improvement-dataset
    Explore at:
    zip(9336513 bytes)Available download formats
    Dataset updated
    Nov 30, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Logical Reasoning Improvement Dataset

    Enhancing LLM Logical Reasoning Skills with Platypus2 Models

    By garage-bAInd (From Huggingface) [source]

    About this dataset

    The garage-bAInd/Open-Platypus dataset is a curated collection of data specifically designed to enhance logical reasoning skills in LLM (Legal Language Model) models. It serves as a training resource for improving the ability of these models to reason logically and provide accurate solutions or answers to various logical reasoning questions.

    This dataset, which has been utilized in training the Platypus2 models, consists of multiple datasets that have undergone a meticulous filtering process. Through keyword search and the application of Sentence Transformers technique, questions with a similarity score above 80% have been eliminated, ensuring that only unique and diverse logical reasoning questions are included.

    The columns in this dataset include: - input : The input text or question that requires logical reasoning. - output : The correct answer or solution to the logical reasoning question. - instruction : Additional instructions or guidelines for solving the logical reasoning question. - data_source : The source or origin of the logical reasoning question.

    By utilizing this comprehensive and carefully curated dataset, LLM models can be trained more effectively to improve their logical reasoning capabilities

    How to use the dataset

    How to Use This Dataset: Logical Reasoning Improvement

    Dataset Overview

    Columns

    The dataset is organized into several columns, each serving a specific purpose:

    • input: The input text or question that requires logical reasoning. This column provides the initial statement or problem that needs solving.
    • output: The correct answer or solution to the logical reasoning question. This column contains the expected outcome or response.
    • instruction: Additional instructions or guidelines for solving the logical reasoning question. This column provides any specific guidance or steps required to arrive at the correct answer.
    • data_source: The source or origin of the logical reasoning question. This column specifies where the question was obtained from.

    Usage Guidelines

    To make effective use of this dataset, follow these guidelines:

    • Familiarize Yourself: Take time to understand and familiarize yourself with each entry in the dataset.
    • Analyze Inputs: Carefully read and analyze each input text/question provided in the input column.
    • Solve Using Logic: Apply logical thinking and reasoning strategies based on your understanding of each problem.
    • Confirm Answers: Compare your solutions with those provided in the output column to check their accuracy.
    • Follow Instructions: Always consider any additional instructions given in the instruction column while solving a problem.
    • Explore Data Sources: Utilize information from different data sources mentioned in the data_source column if needed.

    Remember, practice makes perfect! Continuously work through the dataset to improve your logical reasoning skills.

    Please note that this guide aims to help you utilize the dataset effectively. It does not provide direct solutions or explanations for specific entries in the dataset.

    Contributing and Feedback

    We believe in continuous improvement! If you have any feedback or would like to contribute additional logical reasoning questions, please feel free to do so. Together, we can enhance this dataset further and promote logical reasoning skills across LLM models.

    Let's get started and embark on a journey of logical reasoning improvement with this curated dataset!

    Research Ideas

    • Training and evaluating logical reasoning models: The dataset can be used to train and evaluate logical reasoning models, such as Platypus2, to enhance their performance in solving a variety of logical reasoning questions.
    • Benchmarking logical reasoning algorithms: Researchers and developers can use this dataset as a benchmark for testing and comparing different logical reasoning algorithms and techniques.
    • Creating educational resources: The dataset can be utilized to create educational resources or platforms that focus on improving logical reasoning skills. It can serve as a valuable source of practice questions for learners looking to enhance their abilities in this area

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    **Licen...

  19. h

    Replete-AI_Replete-LLM-Qwen2-7b-details

    • huggingface.co
    Updated Jul 30, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Open LLM Leaderboard (2025). Replete-AI_Replete-LLM-Qwen2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/Replete-AI_Replete-LLM-Qwen2-7b-details
    Explore at:
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Open LLM Leaderboard
    Description

    Dataset Card for Evaluation run of Replete-AI/Replete-LLM-Qwen2-7b

    Dataset automatically created during the evaluation run of model Replete-AI/Replete-LLM-Qwen2-7b The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 9 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Replete-AI_Replete-LLM-Qwen2-7b-details.

  20. Human vs. Machine-Generated Short News

    • kaggle.com
    zip
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kian Jazayeri (2025). Human vs. Machine-Generated Short News [Dataset]. https://www.kaggle.com/datasets/kianjazayeri/human-vs-machine-generated-short-news
    Explore at:
    zip(3715981 bytes)Available download formats
    Dataset updated
    Jul 29, 2025
    Authors
    Kian Jazayeri
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    This dataset provides a valuable benchmark for researchers working on the identification of machine-generated versus human-written text. It features 3,983 short news summaries written by humans, sourced from the News Summary dataset (Inshorts news scraped from The Hindu, Indian Times, and The Guardian, available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary), alongside machine-generated continuations produced using the LLaMA-7b language model under three distinct decoding configurations.

    The goal is to aid the development of robust detection models, facilitate studies on the stylistic differences between human and LLM-authored text, and support further research in Natural Language Processing (NLP), particularly in the domains of fake news detection, AI explainability, and text authenticity.

    Dataset Composition

    Human-Generated News: Original news summaries collected from the News Summary dataset (Inshorts), licensed under GPL v2.0. These summaries span from February to August 2017. The original dataset is available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary

    Machine-Generated News (Setting 1): Continuations generated using LLaMA-7b with balanced generation parameters: (Temperature: 1.0, Top-K: 50, Top-p: 0.9)

    Machine-Generated News (Setting 2): Continuations generated using LLaMA-7b with high creativity and diversity: (Temperature: 1.5, Top-K: 100, Top-p: 0.95)

    Machine-Generated News (Setting 3): Continuations generated using LLaMA-7b with conservative and deterministic settings: (Temperature: 0.7, Top-K: 20, Top-p: 0.8)

    Columns

    Human_News: Human-written summary from the original dataset.

    Machine_Generated_Setting_1: LLaMA-7b output using balanced decoding.

    Machine_Generated_Setting_2: LLaMA-7b output using creative decoding.

    Machine_Generated_Setting_3: LLaMA-7b output using conservative decoding.

    Purpose and Use

    This dataset is intended for:

    Training and evaluating classifiers to distinguish between human and machine-generated content

    Studying linguistic patterns and stylistic variation introduced by different LLM sampling configurations

    Analyzing the limits of creativity, coherence, and factuality in large language models

    It is particularly useful for research in fake news detection, LLM evaluation, AI safety, and human-likeness scoring.

    License

    This dataset is licensed under the GNU General Public License v2.0 (GPL-2.0). This means:

    You are free to use, share, and modify this dataset

    You must distribute any derivative datasets under the same GPL v2.0 license

    You must provide proper attribution and include a copy of the license in your distribution

    Attribution and Acknowledgements

    The human-written news summaries are sourced from:

    News Summary dataset by Kondalarao Vonteru, originally scraped from Inshorts, with news from The Hindu, Indian Times, and The Guardian. License: GPL-2.0

    The machine-generated continuations were created using the LLaMA-7b model by Meta, configured with three different sampling settings.

    License File

    Please refer to the included LICENSE.txt file for the full text of the GNU General Public License v2.0.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Daniel van Strien (2024). dataset-preferences-llm-course-full-dataset [Dataset]. https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset

dataset-preferences-llm-course-full-dataset

davanstrien/dataset-preferences-llm-course-full-dataset

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Daniel van Strien
Description

Dataset Card for dataset-preferences-llm-course-full-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.

Search
Clear search
Close search
Google apps
Main menu