100+ datasets found

h
dataset-preferences-llm-course-full-dataset
huggingface.co
Updated May 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel van Strien (2024). dataset-preferences-llm-course-full-dataset [Dataset]. https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 31, 2024
Authors
Daniel van Strien
Description
Dataset Card for dataset-preferences-llm-course-full-dataset

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.
Training_Data_FineTuning_LLM_PEFT_LORA
kaggle.com
zip
Updated Aug 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupak Roy/ Bob (2024). Training_Data_FineTuning_LLM_PEFT_LORA [Dataset]. https://www.kaggle.com/datasets/rupakroy/training-dataset-peft-lora
Explore at:
zip(29562174 bytes)Available download formats
Dataset updated
Aug 8, 2024
Authors
Rupak Roy/ Bob
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.

The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum
LLM Prompt Recovery - Synthetic Datastore
kaggle.com
zip
Updated Feb 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Darien Schettler (2024). LLM Prompt Recovery - Synthetic Datastore [Dataset]. https://www.kaggle.com/datasets/dschettler8845/llm-prompt-recovery-synthetic-datastore
Explore at:
zip(988448 bytes)Available download formats
Dataset updated
Feb 29, 2024
Authors
Darien Schettler
License
https://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
Description
High Level Description

This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.

Contributors

Please go upvote these other datasets as my work is not possible without them

thedrcat's dataset - LLM Prompt Recovery Data

TBD

First Dataset - 1000 Examples From @thedrcat

Update 1 - February 29, 2024

The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv

The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah

' sentence removed.

I generated things using the following setup:

# I used a vLLM server to host Gemma 7B on paperspace (A100) # Step 1 - Install vLLM >>> pip install vllm # Step 2 - Authenticate HuggingFace CLI (for model weights) >>> huggingface-cli login --token
resume-screening-llm-training-dataset
kaggle.com
zip
Updated Sep 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Syncora_ai (2025). resume-screening-llm-training-dataset [Dataset]. https://www.kaggle.com/datasets/syncoraai/resume-screening-llm-training-dataset
Explore at:
zip(60353 bytes)Available download formats
Dataset updated
Sep 11, 2025
Authors
Syncora_ai
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Resume Screening & HR Conversations Dataset for LLM Training

Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.

It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.

By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

✅ Why This Dataset?

Accelerate HR AI development – No need to source or anonymize real resumes.

Ready for LLM Training – Structured in OpenAI-compatible JSONL format.

Privacy-Safe & Scalable – 100% synthetic, zero PII.

📂 Dataset Description

This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.

Format: JSONL (OpenAI fine-tuning compatible)

Schema:

messages: An array of chat messages with roles (system, user, assistant) and their respective content.

Example:

{ "messages": [ {"role": "system", "content": "You are an informative assistant."}, {"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"}, {"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."} ] }

🔍 What's Inside

Resume Screening Q&A

Job titles, experience, role responsibilities.

Career Guidance Questions

Tech trends, upskilling, and career planning.

👥 Who Should Use This Dataset?

Recruitment Tech Startups – Build automated candidate screeners.

HR Teams – Deploy internal career chatbots.

AI Engineers & Researchers – Fine-tune LLMs for HR applications.

EdTech Platforms – Power career advisory assistants.

🚀 How To Use

Fine-tune with OpenAI: bash openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"

✅ Why Synthetic?

No compliance headaches (GDPR, PII safe)

Faster experimentation for LLM applications

High-fidelity HR scenarios for realistic model behavior

🔗 Start Generating Your Own Synthetic Data

→ Generate your own synthetic datasets now
h
clinical-synthetic-text-llm
huggingface.co
Updated Jul 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ran Xu (2024). clinical-synthetic-text-llm [Dataset]. https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2024
Authors
Ran Xu
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Data Description

We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.

Generated Datasets

The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
h
HarmfulQA
huggingface.co
Updated Aug 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deep Cognition and Language Research (DeCLaRe) Lab (2023). HarmfulQA [Dataset]. https://huggingface.co/datasets/declare-lab/HarmfulQA
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 20, 2023
Dataset authored and provided by
Deep Cognition and Language Research (DeCLaRe) Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣

As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.
deepseek-llm-7b-base
kaggle.com
zip
Updated Jan 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Younus_Mohamed (2025). deepseek-llm-7b-base [Dataset]. https://www.kaggle.com/datasets/younusmohamed/deepseek-llm-7b-base
Explore at:
zip(21941318019 bytes)Available download formats
Dataset updated
Jan 30, 2025
Authors
Younus_Mohamed
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
DeepSeek Model Dataset

Overview

This dataset contains the DeepSeek model, a [brief description of the model, e.g., "state-of-the-art language model for natural language processing tasks"]. The model is designed for [specific use cases, e.g., "text generation, sentiment analysis, etc."].

Contents

model_weights/: Directory containing the model weights.

config.json: Configuration file for the model.

inference_example.ipynb: Jupyter Notebook demonstrating how to load and use the model.

requirements.txt: List of Python dependencies.

Usage

Download the dataset from Kaggle.

Install the required dependencies using pip install -r requirements.txt.

Open the inference_example.ipynb notebook to see how to load the model and perform inference.

License

This dataset is licensed under [license name, e.g., "MIT License"]. See the LICENSE file for more details.

Acknowledgments

Deepseek

Contact

For questions or issues, please contact
d
Customer Service Call Dataset [Multisector] – Annotated support transcripts...
datarade.ai
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WiserBrand.com (2025). Customer Service Call Dataset [Multisector] – Annotated support transcripts for training AI and improving CX [Dataset]. https://datarade.ai/data-products/customer-service-call-dataset-multisector-annotated-suppo-wiserbrand-com
Explore at:
.json, .csv, .xls, .txtAvailable download formats
Dataset updated
Apr 11, 2025
Dataset provided by
WiserBrand
Area covered
United States of America
Description
"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.

Included in each record:

Full call transcription with labeled speakers (system, agent, customer)

Concise human-written summary of the conversation

Sentiment tag for the overall interaction: positive, neutral, or negative

Company name, duration, and geographic location of the caller

Call context includes industries such as eCommerce, banking, telecom, and streaming services

Common use cases:

Train NLP models to understand support calls and detect churn risk

Power complaint detection engines for customer success and support teams

Create high-quality LLM training sets with real support narratives

Build summarization and topic tagging pipelines for CX dashboards

Analyze tone shifts and resolution language in customer-agent interaction

This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."

The more you purchase, the lower the price will be.
f
Main Data and Code
figshare.com
zip
Updated Oct 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Momo (2025). Main Data and Code [Dataset]. http://doi.org/10.6084/m9.figshare.29929412.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29929412.v1
Dataset updated
Oct 5, 2025
Dataset provided by
figshare
Authors
Momo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py：Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
h
fine-tuning-llm
huggingface.co
Updated Oct 31, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
H (2024). fine-tuning-llm [Dataset]. https://huggingface.co/datasets/KarinaH/fine-tuning-llm
Explore at:
Dataset updated
Oct 31, 2024
Authors
H
Description
Dataset Card for fine-tuning-llm

This dataset has been created with distilabel.

Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/KarinaH/fine-tuning-llm/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/KarinaH/fine-tuning-llm.
h
Hermes-3-Llama-3.1-8B-details
huggingface.co
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). Hermes-3-Llama-3.1-8B-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/Hermes-3-Llama-3.1-8B-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of NousResearch/Hermes-3-Llama-3.1-8B

Dataset automatically created during the evaluation run of model NousResearch/Hermes-3-Llama-3.1-8B The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Hermes-3-Llama-3.1-8B-details.
AI Safety Verification Dataset
kaggle.com
zip
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Priyam Saha (2025). AI Safety Verification Dataset [Dataset]. https://www.kaggle.com/datasets/priyamsaha17/ai-safety-verification-dataset
Explore at:
zip(147706374 bytes)Available download formats
Dataset updated
Aug 25, 2025
Authors
Priyam Saha
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview An aggregated, cleaned, and unified dataset assembled from the AI Verify Foundation’s Project Moonshot resources on Kaggle. It combines: (a) prompt templates and prompt-engineering cookbooks, (b) pre-built recipes used to configure benchmark runs (input/target pairs, evaluation metric, grading scales), and (c) metric definitions/outputs for automated evaluation. The material is intended to support reproducible LLM benchmarking, bias/fairness analysis, and prompt-engineering experiments.

Project Moonshot Project Moonshot is an open-source LLM evaluation toolkit produced by the AI Verify Foundation; it brings benchmarking and red-teaming workflows together and publishes prompt templates, recipes and metrics on GitHub and the Moonshot docs site. Link - https://aiverifyfoundation.sg/project-moonshot/

Recipe Recipes (in Moonshot) are pre-built benchmark configurations: JSON files that define the dataset (input / target pairs), the prompt template to use, the evaluation metric, and any grading thresholds — enabling reproducible, repeatable test runs. The Moonshot project publishes many such recipes for different evaluation categories (e.g., prompt injection, cybersecurity).

Cookbook Cookbook (in ML/prompting context) is a curated collection of patterns, examples and “how-to” snippets for solving common tasks with LLMs (templates, best practices, and worked examples). Think of a cookbook as a higher-level collection that organizes recipes and templates for reuse

Intended uses - Reproducible LLM benchmarking and regression testing. - Bias and fairness audits (compare performance across social attribute groups). - Prompt engineering research (compare prompt templates / recipe variants). - Building evaluation pipelines that combine semantic and factual checks.

Credits: This dataset aggregates content published by the AI Verify Foundation / Project Moonshot. Please follow the original project’s license and attribution requirements when redistributing. See the Moonshot repository for license details. URL: https://aiverifyfoundation.sg/project-moonshot/ GitHub: https://github.com/aiverify-foundation/moonshot
h
Bitext-retail-banking-llm-chatbot-training-dataset
huggingface.co
Updated Jul 16, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bitext (2024). Bitext-retail-banking-llm-chatbot-training-dataset [Dataset]. https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 16, 2024
Dataset authored and provided by
Bitext
License
https://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Description
Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants

Overview

This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.
Data from: AstroChat
kaggle.com
huggingface.co
zip
Updated Jun 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
astro_pat (2024). AstroChat [Dataset]. https://www.kaggle.com/datasets/patrickfleith/astrochat
Explore at:
zip(1214166 bytes)Available download formats
Dataset updated
Jun 9, 2024
Authors
astro_pat
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Purpose and Scope

The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.

Intended Use

The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).

Quickstart

To be completed

DATASET DESCRIPTION

Access

Manual download from Hugging face hub: https://huggingface.co/datasets/patrickfleith/AstroChat

Or with python: python from datasets import load_dataset dataset = load_dataset("patrickfleith/AstroChat")

Structure

901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column): - id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets. - topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split. - subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc. - persona: description of the persona used to simulate a user - opening_question: the first question asked by the user to start a conversation with the AI-assistant - messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields: - role: the role of the speaker, either user or assistant - content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.

Important See the full list of topics and subtopics covered below.

Metadata

Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main

Generation Method

We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:

Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.

Step-by-step description

Defined a set of user persona

Defined a set of topics/ disciplines within the domain of Astronautics / Space Mission Engineering

For each topics, we defined a set of subtopics to narrow down the conversation to more specific and niche conversations (see below the full list)

For each subtopic we generate a set of opening questions that the user could ask to start a conversation (see below the full list)

We then distil the knowledge of an strong Chat Model (in our case ChatGPT through then api with gpt-4-turbo model) to generate the answers to the opening questions

We simulate follow-up questions from the user to the assistant, and the assistant's answers to these questions which builds up the messages.

Future work and contributions appreciated

Distil knowledge from more models (Anthropic, Mixtral, GPT-4o, etc...)

Implement more creativity in the opening questions and follow-up questions

Filter-out questions and conversations which are too similar

Ask topic and subtopic expert to validate the generated conversations to have a sense on how reliable is the overall dataset

Languages

All instances in the dataset are in english

Size

901 synthetically-generated dialogue

USAGE AND GUIDELINES

License

AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International

Restrictions

No restriction. Please provide the correct attribution following the license terms.

Citation

Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579

Update Frequency

Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)

Have a feedback or spot an error?

Use the ...
h
deepseek-ai_deepseek-llm-7b-base-details
huggingface.co
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). deepseek-ai_deepseek-llm-7b-base-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/deepseek-ai_deepseek-llm-7b-base-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of deepseek-ai/deepseek-llm-7b-base

Dataset automatically created during the evaluation run of model deepseek-ai/deepseek-llm-7b-base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/deepseek-ai_deepseek-llm-7b-base-details.
Article Dataset (Mini)
kaggle.com
zip
Updated Oct 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sani Kamal (2024). Article Dataset (Mini) [Dataset]. https://www.kaggle.com/datasets/sanikamal/article-50/code
Explore at:
zip(3563613 bytes)Available download formats
Dataset updated
Oct 18, 2024
Authors
Sani Kamal
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Overview

This dataset contains 50 articles sourced from Medium, focusing on AI-related content. It is designed for business owners, content creators, and AI developers looking to analyze successful articles, improve engagement, and fine-tune AI language models (LLMs). The data can be used to explore what makes articles perform well, including sentiment analysis, follower counts, and headline effectiveness.

Dataset Contents

articles_50.db - Sample database with 50 articles(Free Version)

The database includes pre-analyzed data such as sentiment scores, follower counts, and headline metadata, helping users gain insights into high-performing content.

Use Cases

Content Strategy Optimization: Identify trends in successful AI-related articles to enhance your content approach.

Headline Crafting: Study patterns in top-performing headlines to create more compelling article titles.

LLM Fine-Tuning: Utilize the dataset to fine-tune AI models with real-world data on content performance.

Sentiment-Driven Content: Create content that resonates with readers by aligning with sentiment insights.

This dataset is a valuable tool for anyone aiming to harness the power of data-driven insights to enhance their content or AI models.
h
experiment-llm_exp-3-q-r-details
huggingface.co
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). experiment-llm_exp-3-q-r-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/experiment-llm_exp-3-q-r-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of experiment-llm/exp-3-q-r

Dataset automatically created during the evaluation run of model experiment-llm/exp-3-q-r The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/experiment-llm_exp-3-q-r-details.
Logical Reasoning Improvement Dataset
kaggle.com
zip
Updated Nov 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Logical Reasoning Improvement Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/logical-reasoning-improvement-dataset
Explore at:
zip(9336513 bytes)Available download formats
Dataset updated
Nov 30, 2023
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Logical Reasoning Improvement Dataset

Enhancing LLM Logical Reasoning Skills with Platypus2 Models

By garage-bAInd (From Huggingface) [source]

About this dataset

The garage-bAInd/Open-Platypus dataset is a curated collection of data specifically designed to enhance logical reasoning skills in LLM (Legal Language Model) models. It serves as a training resource for improving the ability of these models to reason logically and provide accurate solutions or answers to various logical reasoning questions.

This dataset, which has been utilized in training the Platypus2 models, consists of multiple datasets that have undergone a meticulous filtering process. Through keyword search and the application of Sentence Transformers technique, questions with a similarity score above 80% have been eliminated, ensuring that only unique and diverse logical reasoning questions are included.

The columns in this dataset include: - input : The input text or question that requires logical reasoning. - output : The correct answer or solution to the logical reasoning question. - instruction : Additional instructions or guidelines for solving the logical reasoning question. - data_source : The source or origin of the logical reasoning question.

By utilizing this comprehensive and carefully curated dataset, LLM models can be trained more effectively to improve their logical reasoning capabilities

How to use the dataset

How to Use This Dataset: Logical Reasoning Improvement

Dataset Overview

Columns

The dataset is organized into several columns, each serving a specific purpose:

input: The input text or question that requires logical reasoning. This column provides the initial statement or problem that needs solving.

output: The correct answer or solution to the logical reasoning question. This column contains the expected outcome or response.

instruction: Additional instructions or guidelines for solving the logical reasoning question. This column provides any specific guidance or steps required to arrive at the correct answer.

data_source: The source or origin of the logical reasoning question. This column specifies where the question was obtained from.

Usage Guidelines

To make effective use of this dataset, follow these guidelines:

Familiarize Yourself: Take time to understand and familiarize yourself with each entry in the dataset.

Analyze Inputs: Carefully read and analyze each input text/question provided in the input column.

Solve Using Logic: Apply logical thinking and reasoning strategies based on your understanding of each problem.

Confirm Answers: Compare your solutions with those provided in the output column to check their accuracy.

Follow Instructions: Always consider any additional instructions given in the instruction column while solving a problem.

Explore Data Sources: Utilize information from different data sources mentioned in the data_source column if needed.

Remember, practice makes perfect! Continuously work through the dataset to improve your logical reasoning skills.

Please note that this guide aims to help you utilize the dataset effectively. It does not provide direct solutions or explanations for specific entries in the dataset.

Contributing and Feedback

We believe in continuous improvement! If you have any feedback or would like to contribute additional logical reasoning questions, please feel free to do so. Together, we can enhance this dataset further and promote logical reasoning skills across LLM models.

Let's get started and embark on a journey of logical reasoning improvement with this curated dataset!

Research Ideas

Training and evaluating logical reasoning models: The dataset can be used to train and evaluate logical reasoning models, such as Platypus2, to enhance their performance in solving a variety of logical reasoning questions.

Benchmarking logical reasoning algorithms: Researchers and developers can use this dataset as a benchmark for testing and comparing different logical reasoning algorithms and techniques.

Creating educational resources: The dataset can be utilized to create educational resources or platforms that focus on improving logical reasoning skills. It can serve as a valuable source of practice questions for learners looking to enhance their abilities in this area

Acknowledgements

If you use this dataset in your research, please credit the original authors. Data Source

License

**Licen...
h
Replete-AI_Replete-LLM-Qwen2-7b-details
huggingface.co
Updated Jul 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Open LLM Leaderboard (2025). Replete-AI_Replete-LLM-Qwen2-7b-details [Dataset]. https://huggingface.co/datasets/open-llm-leaderboard/Replete-AI_Replete-LLM-Qwen2-7b-details
Explore at:
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Open LLM Leaderboard
Description
Dataset Card for Evaluation run of Replete-AI/Replete-LLM-Qwen2-7b

Dataset automatically created during the evaluation run of model Replete-AI/Replete-LLM-Qwen2-7b The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 9 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Replete-AI_Replete-LLM-Qwen2-7b-details.
Human vs. Machine-Generated Short News
kaggle.com
zip
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kian Jazayeri (2025). Human vs. Machine-Generated Short News [Dataset]. https://www.kaggle.com/datasets/kianjazayeri/human-vs-machine-generated-short-news
Explore at:
zip(3715981 bytes)Available download formats
Dataset updated
Jul 29, 2025
Authors
Kian Jazayeri
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
This dataset provides a valuable benchmark for researchers working on the identification of machine-generated versus human-written text. It features 3,983 short news summaries written by humans, sourced from the News Summary dataset (Inshorts news scraped from The Hindu, Indian Times, and The Guardian, available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary), alongside machine-generated continuations produced using the LLaMA-7b language model under three distinct decoding configurations.

The goal is to aid the development of robust detection models, facilitate studies on the stylistic differences between human and LLM-authored text, and support further research in Natural Language Processing (NLP), particularly in the domains of fake news detection, AI explainability, and text authenticity.

Dataset Composition

Human-Generated News: Original news summaries collected from the News Summary dataset (Inshorts), licensed under GPL v2.0. These summaries span from February to August 2017. The original dataset is available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary

Machine-Generated News (Setting 1): Continuations generated using LLaMA-7b with balanced generation parameters: (Temperature: 1.0, Top-K: 50, Top-p: 0.9)

Machine-Generated News (Setting 2): Continuations generated using LLaMA-7b with high creativity and diversity: (Temperature: 1.5, Top-K: 100, Top-p: 0.95)

Machine-Generated News (Setting 3): Continuations generated using LLaMA-7b with conservative and deterministic settings: (Temperature: 0.7, Top-K: 20, Top-p: 0.8)

Columns

Human_News: Human-written summary from the original dataset.

Machine_Generated_Setting_1: LLaMA-7b output using balanced decoding.

Machine_Generated_Setting_2: LLaMA-7b output using creative decoding.

Machine_Generated_Setting_3: LLaMA-7b output using conservative decoding.

Purpose and Use

This dataset is intended for:

Training and evaluating classifiers to distinguish between human and machine-generated content

Studying linguistic patterns and stylistic variation introduced by different LLM sampling configurations

Analyzing the limits of creativity, coherence, and factuality in large language models

It is particularly useful for research in fake news detection, LLM evaluation, AI safety, and human-likeness scoring.

License

This dataset is licensed under the GNU General Public License v2.0 (GPL-2.0). This means:

You are free to use, share, and modify this dataset

You must distribute any derivative datasets under the same GPL v2.0 license

You must provide proper attribution and include a copy of the license in your distribution

Attribution and Acknowledgements

The human-written news summaries are sourced from:

News Summary dataset by Kondalarao Vonteru, originally scraped from Inshorts, with news from The Hindu, Indian Times, and The Guardian. License: GPL-2.0

The machine-generated continuations were created using the LLaMA-7b model by Meta, configured with three different sampling settings.

License File

Please refer to the included LICENSE.txt file for the full text of the GNU General Public License v2.0.

Facebook

Twitter

Click to copy link

Link copied

Cite

Daniel van Strien (2024). dataset-preferences-llm-course-full-dataset [Dataset]. https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset

dataset-preferences-llm-course-full-dataset

davanstrien/dataset-preferences-llm-course-full-dataset

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

May 31, 2024

Authors

Daniel van Strien

Description

Dataset Card for dataset-preferences-llm-course-full-dataset

This dataset has been created with distilabel.

  Dataset Summary

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"

or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.

Clear search

Close search

Google apps

Main menu

dataset-preferences-llm-course-full-dataset

Training_Data_FineTuning_LLM_PEFT_LORA

LLM Prompt Recovery - Synthetic Datastore

High Level Description

Contributors

First Dataset - 1000 Examples From @thedrcat

resume-screening-llm-training-dataset

Resume Screening & HR Conversations Dataset for LLM Training

By Syncora.ai, enabling privacy-safe, high-quality synthetic data for smarter AI.

✅ Why This Dataset?

📂 Dataset Description

Example:

🔍 What's Inside

👥 Who Should Use This Dataset?

🚀 How To Use

✅ Why Synthetic?

🔗 Start Generating Your Own Synthetic Data

clinical-synthetic-text-llm

HarmfulQA

deepseek-llm-7b-base

DeepSeek Model Dataset

Overview

Contents

Usage

License

Acknowledgments

Contact

Customer Service Call Dataset [Multisector] – Annotated support transcripts...

Main Data and Code

fine-tuning-llm

Hermes-3-Llama-3.1-8B-details

AI Safety Verification Dataset

Bitext-retail-banking-llm-chatbot-training-dataset

Data from: AstroChat

Purpose and Scope

Intended Use

Quickstart

DATASET DESCRIPTION

Access

Structure

Metadata

Generation Method

Step-by-step description

Future work and contributions appreciated

Languages

Size

USAGE AND GUIDELINES

License

Restrictions

Citation

Update Frequency

Have a feedback or spot an error?

deepseek-ai_deepseek-llm-7b-base-details

Article Dataset (Mini)

Overview

Dataset Contents

Use Cases

experiment-llm_exp-3-q-r-details

Logical Reasoning Improvement Dataset

Logical Reasoning Improvement Dataset

Enhancing LLM Logical Reasoning Skills with Platypus2 Models

About this dataset

How to use the dataset

How to Use This Dataset: Logical Reasoning Improvement

Dataset Overview

Columns

Usage Guidelines

Contributing and Feedback

Research Ideas

Acknowledgements

License

Replete-AI_Replete-LLM-Qwen2-7b-details

Human vs. Machine-Generated Short News

Dataset Composition

Columns

Purpose and Use

License

Attribution and Acknowledgements

License File

dataset-preferences-llm-course-full-datasetSee More Versions

dataset-preferences-llm-course-full-dataset