Facebook
TwitterDataset Card for dataset-preferences-llm-course-full-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The dataset contains conversation summaries, topics, and dialogues used to create the pipeline of fine tunning the LLM model using Parameter Efficient Fine Tunning and Low-Rank Adaptation of Large Language Models) which is a popular and lightweight training technique that significantly reduces the number of trainable parameters.
The dataset is also available in the hugging face. https://huggingface.co/datasets/knkarthick/dialogsum
Facebook
Twitterhttps://www.licenses.ai/ai-licenseshttps://www.licenses.ai/ai-licenses
This dataset uses Gemma 7B-IT to generate synthetic dataset for the LLM Prompt Recovery competition.
Please go upvote these other datasets as my work is not possible without them
Update 1 - February 29, 2024
The only file presently found in this dataset is gemma1000_7b.csv which uses the dataset created by @thedrcat found here: https://www.kaggle.com/datasets/thedrcat/llm-prompt-recovery-data?select=gemma1000.csv
The file below is the file Darek created with two additional columns appended. The first is the output of Gemma 7B-IT (raw based on the instructions below)(vs. 2B-IT that Darek used) and the second is the output with the 'Sure... blah blah
' sentence removed.
I generated things using the following setup:
# I used a vLLM server to host Gemma 7B on paperspace (A100)
# Step 1 - Install vLLM
>>> pip install vllm
# Step 2 - Authenticate HuggingFace CLI (for model weights)
>>> huggingface-cli login --token
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Recruitment and career advisory teams in the HR industry often face challenges with sensitive, hard-to-access data. This dataset removes that barrier by providing synthetic HR conversations and resume screening Q&A, structured for LLM training in JSONL format.
It enables HR teams and AI developers to build smarter internal chatbots, automate candidate screening, accelerate onboarding workflows, and create AI-powered career advisory tools, all while keeping data privacy intact. This helps organizations improve efficiency, reduce manual effort, and scale AI-driven HR solutions.
This dataset contains synthetic HR conversations and resume screening Q&A, formatted for LLM fine-tuning. Each record represents a dialogue simulating real-world HR workflows like candidate evaluation and career guidance.
messages: An array of chat messages with roles (system, user, assistant) and their respective content.{
"messages": [
{"role": "system", "content": "You are an informative assistant."},
{"role": "user", "content": "What is AbdulMuiz Shaikh's current job title?"},
{"role": "assistant", "content": "AbdulMuiz Shaikh's current job title is Associate Data Scientist."}
]
}
Fine-tune with OpenAI:
bash
openai tools fine_tunes.prepare_data -f hr_resume_qna.jsonl
openai api fine_tunes.create -t "hr_resume_qna.jsonl" -m "gpt-3.5-turbo"
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Data Description
We release the synthetic data generated using the method described in the paper Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models (ACL 2024 Findings). The external knowledge we use is based on LLM-generated topics and writing styles.
Generated Datasets
The original train/validation/test data, and the generated synthetic training data are listed as follows. For each dataset, we generate 5000… See the full description on the dataset page: https://huggingface.co/datasets/ritaranx/clinical-synthetic-text-llm.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Paper | Github | Dataset| Model 📣📣📣: Do check our new multilingual dataset CatQA here used in Safety Vectors:📣📣📣
As a part of our research efforts toward making LLMs more safe for public use, we create HarmfulQA i.e. a ChatGPT-distilled dataset constructed using the Chain of Utterances (CoU) prompt. More details are in our paper Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment HarmfulQA serves as both-a new LLM safety benchmark and an alignment dataset… See the full description on the dataset page: https://huggingface.co/datasets/declare-lab/HarmfulQA.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains the DeepSeek model, a [brief description of the model, e.g., "state-of-the-art language model for natural language processing tasks"]. The model is designed for [specific use cases, e.g., "text generation, sentiment analysis, etc."].
model_weights/: Directory containing the model weights.config.json: Configuration file for the model.inference_example.ipynb: Jupyter Notebook demonstrating how to load and use the model.requirements.txt: List of Python dependencies.pip install -r requirements.txt.inference_example.ipynb notebook to see how to load the model and perform inference.This dataset is licensed under [license name, e.g., "MIT License"]. See the LICENSE file for more details.
For questions or issues, please contact
Facebook
Twitter"This dataset contains transcribed customer support calls from companies in over 160 industries, offering a high-quality foundation for developing customer-aware AI systems and improving service operations. It captures how real people express concerns, frustrations, and requests — and how support teams respond.
Included in each record:
Common use cases:
This dataset is structured, high-signal, and ready for use in AI pipelines, CX design, and quality assurance systems. It brings full transparency to what actually happens during customer service moments — from routine fixes to emotional escalations."
The more you purchase, the lower the price will be.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Important Notice: Ethical Use OnlyThis repository provides code and datasets for academic research on misinformation.Please note that the datasets include rumor-related texts. These materials are supplied solely for scholarly analysis and research aimed at understanding and combating misinformation.Prohibited UseDo not use this repository, including its code or data, to create or spread false information in any real-world context.Any misuse of these resources for malicious purposes is strictly forbidden.DisclaimerThe authors bear no responsibility for any unethical or unlawful use of the provided resources.By accessing or using this repository, you acknowledge and agree to comply with these ethical guidelines.Project StructureThe project is organized into three main directories, each corresponding to a major section of the paper's experiments:main_data_and_code/├── rumor_generation/├── rumor_detection/└── rumor_debunking/How to Get StartedPrerequisitesTo successfully run the code and reproduce the results, you will need to:Obtain and configure your own API key for the large language models (LLMs) used in the experiments. Please replace the placeholder API key in the code with your own.For the rumor detection experiments, download the public datasets (Twitter15, Twitter16, FakeNewsNet) from their respective sources. The pre-process scripts in the rumor detection folder must be run first to prepare the public datasets.Please note that many scripts are provided as examples using the Twitter15 dataset. To run experiments on other datasets like Twitter16 or FakeNewsNet, you will need to modify these scripts or create copies and update the corresponding file paths.Detailed Directory Breakdown1. rumor_generation/This directory contains all the code and data related to the rumor generation experiments.rumor_generation_zeroshot.py: Code for the zero-shot rumor generation experiment.rumor_generation_fewshot.py: Code for the few-shot rumor generation experiment.rumor_generation_cot.py: Code for the chain-of-thought (CoT) rumor generation experiment.token_distribution.py: Script to analyze token distribution in the generated text.label_rumors.py:Script to label LLM-generated texts based on whether they contain rumor-related content.extract_reasons.py: Script to extract reasons for rumor generation and rejection.visualization.py: Utility script for generating figures.LDA.py: Code for performing LDA topic modeling on the generated data.rumor_generation_responses.json: The complete output dataset from the rumor generation experiments.generation_reasons_extracted.json: The extracted reasons for generated rumors.rejection_reasons_extracted.json: The extracted reasons for rejected rumor generation requests.2. rumor_detection/This directory contains the code and data used for the rumor detection experiments.nonreasoning_zeroshot_twitter15.py: Code for the non-reasoning, zero-shot detection on the Twitter15 dataset. To run on Twitter16 or FakeNewsNet, update the file paths within the script. Similar experiment scripts below follow the same principle and are not described repeatedly.nonreasoning_fewshot_twitter15.py: Code for the non-reasoning, few-shot detection on the Twitter15 dataset.nonreasoning_cot_twitter15.py: Code for the non-reasoning, CoT detection on the Twitter15 dataset.reasoning_zeroshot_twitter15.py: Code for the Reasoning LLMs, zero-shot detection on the Twitter15 dataset.reasoning_fewshot_twitter15.py: Code for the Reasoning LLMs, few-shot detection on the Twitter15 dataset.reasoning_cot_twitter15.py: Code for the Reasoning LLMs, CoT detection on the Twitter15 dataset.traditional_model.py: Code for the traditional models used as baselines.preprocess_twitter15_and_twitter16.py: Script for preprocessing the Twitter15 and Twitter16 datasets.preprocess_fakenews.py: Script for preprocessing the FakeNewsNet dataset.generate_summary_table.py: Calculates all classification metrics and generates the final summary table for the rumor detection experiments.select_few_shot_example_15.py: Script to pre-select few-shot examples, using the Twitter15 dataset as an example. To generate examples for Twitter16 or FakeNewsNet, update the file paths within the script.twitter15_few_shot_examples.json: Pre-selected few-shot examples for the Twitter15 dataset.twitter16_few_shot_examples.json: Pre-selected few-shot examples for the Twitter16 dataset.fakenewsnet_few_shot_examples.json: Pre-selected few-shot examples for the FakeNewsNet dataset.twitter15_llm_results.json: LLM prediction results on the Twitter15 dataset.twitter16_llm_results.json: LLM prediction results on the Twitter16 dataset.fakenewsnet_llm_results.json: LLM prediction results on the FakeNewsNet dataset.visualization.py: Utility script for generating figures.3. rumor_debunking/This directory contains all the code and data for the rumor debunking experiments.analyze_sentiment.py: Script for analyzing the sentiment of the debunking texts.calculate_readability.py: Script for calculating the readability score of the debunking texts.plot_readability.py: Utility script for generating figures related to readability.fact_checking_with_nli.py: Code for the NLI-based fact-checking experiment.debunking_results.json: The dataset containing the debunking results for this experimental section.debunking_results_with_readability.json: The dataset containing the debunking results along with readability scores.sentiment_analysis/: This directory contains the result file from the sentiment analysis.debunking_results_with_sentiment.json: The dataset containing the debunking results along with sentiment analysis.Please contact the repository owner if you encounter any problems or have questions about the code or data.
Facebook
TwitterDataset Card for fine-tuning-llm
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/KarinaH/fine-tuning-llm/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline info --config… See the full description on the dataset page: https://huggingface.co/datasets/KarinaH/fine-tuning-llm.
Facebook
TwitterDataset Card for Evaluation run of NousResearch/Hermes-3-Llama-3.1-8B
Dataset automatically created during the evaluation run of model NousResearch/Hermes-3-Llama-3.1-8B The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Hermes-3-Llama-3.1-8B-details.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Overview An aggregated, cleaned, and unified dataset assembled from the AI Verify Foundation’s Project Moonshot resources on Kaggle. It combines: (a) prompt templates and prompt-engineering cookbooks, (b) pre-built recipes used to configure benchmark runs (input/target pairs, evaluation metric, grading scales), and (c) metric definitions/outputs for automated evaluation. The material is intended to support reproducible LLM benchmarking, bias/fairness analysis, and prompt-engineering experiments.
Project Moonshot Project Moonshot is an open-source LLM evaluation toolkit produced by the AI Verify Foundation; it brings benchmarking and red-teaming workflows together and publishes prompt templates, recipes and metrics on GitHub and the Moonshot docs site. Link - https://aiverifyfoundation.sg/project-moonshot/
Recipe Recipes (in Moonshot) are pre-built benchmark configurations: JSON files that define the dataset (input / target pairs), the prompt template to use, the evaluation metric, and any grading thresholds — enabling reproducible, repeatable test runs. The Moonshot project publishes many such recipes for different evaluation categories (e.g., prompt injection, cybersecurity).
Cookbook Cookbook (in ML/prompting context) is a curated collection of patterns, examples and “how-to” snippets for solving common tasks with LLMs (templates, best practices, and worked examples). Think of a cookbook as a higher-level collection that organizes recipes and templates for reuse
Intended uses - Reproducible LLM benchmarking and regression testing. - Bias and fairness audits (compare performance across social attribute groups). - Prompt engineering research (compare prompt templates / recipe variants). - Building evaluation pipelines that combine semantic and factual checks.
Credits: This dataset aggregates content published by the AI Verify Foundation / Project Moonshot. Please follow the original project’s license and attribution requirements when redistributing. See the Moonshot repository for license details. URL: https://aiverifyfoundation.sg/project-moonshot/ GitHub: https://github.com/aiverify-foundation/moonshot
Facebook
Twitterhttps://choosealicense.com/licenses/cdla-sharing-1.0/https://choosealicense.com/licenses/cdla-sharing-1.0/
Bitext - Retail Banking Tagged Training Dataset for LLM-based Virtual Assistants
Overview
This hybrid synthetic dataset is designed to be used to fine-tune Large Language Models such as GPT, Mistral and OpenELM, and has been generated using our NLP/NLG technology and our automated Data Labeling (DAL) tools. The goal is to demonstrate how Verticalization/Domain Adaptation for the [Retail Banking] sector can be easily achieved using our two-step approach to LLM Fine-Tuning.… See the full description on the dataset page: https://huggingface.co/datasets/bitext/Bitext-retail-banking-llm-chatbot-training-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The AstroChat dataset is a collection of 901 dialogues, synthetically generated, tailored to the specific domain of Astronautics / Space Mission Engineering. This dataset will be frequently updated following feedback from the community. If you would like to contribute, please reach out in the community discussion.
The dataset is intended to be used for supervised fine-tuning of chat LLMs (Large Language Models). Due to its currently limited size, you should use a pre-trained instruct model and ideally augment the AstroChat dataset with other datasets in the area of (Science Technology, Engineering and Math).
To be completed
python
from datasets import load_dataset
dataset = load_dataset("patrickfleith/AstroChat")901 generated conversations between a simulated user and AI-assistant (more on the generation method below). Each instance is made of the following field (column):
- id: a unique identifier to refer to this specific conversation. Useeful for traceability purposes, especially for further processing task or merge with other datasets.
- topic: a topic within the domain of Astronautics / Space Mission Engineering. This field is useful to filter the dataset by topic, or to create a topic-based split.
- subtopic: a subtopic of the topic. For instance in the topic of Propulsion, there are subtopics like Injector Design, Combustion Instability, Electric Propulsion, Chemical Propulsion, etc.
- persona: description of the persona used to simulate a user
- opening_question: the first question asked by the user to start a conversation with the AI-assistant
- messages: the whole conversation messages between the user and the AI assistant in already nicely formatted for rapid use with the transformers library. A list of messages where each message is a dictionary with the following fields:
- role: the role of the speaker, either user or assistant
- content: the message content. For the assistant, it is the answer to the user's question. For the user, it is the question asked to the assistant.
Important See the full list of topics and subtopics covered below.
Dataset is version controlled and commits history is available here: https://huggingface.co/datasets/patrickfleith/AstroChat/commits/main
We used a method inspired from Ultrachat dataset. Especially, we implemented our own version of Human-Model interaction from Sector I: Questions about the World of their paper:
Ding, N., Chen, Y., Xu, B., Qin, Y., Zheng, Z., Hu, S., ... & Zhou, B. (2023). Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
gpt-4-turbo model) to generate the answers to the opening questionsAll instances in the dataset are in english
901 synthetically-generated dialogue
AstroChat © 2024 by Patrick Fleith is licensed under Creative Commons Attribution 4.0 International
No restriction. Please provide the correct attribution following the license terms.
Patrick Fleith. (2024). AstroChat - A Dataset of synthetically generated conversations for LLM supervised fine-tuning in the domain of Space Mission Engineering and Astronautics (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11531579
Will be updated based on feedbacks. I am also looking for contributors. Help me create more datasets for Space Engineering LLMs :)
Use the ...
Facebook
TwitterDataset Card for Evaluation run of deepseek-ai/deepseek-llm-7b-base
Dataset automatically created during the evaluation run of model deepseek-ai/deepseek-llm-7b-base The dataset is composed of 44 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/deepseek-ai_deepseek-llm-7b-base-details.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains 50 articles sourced from Medium, focusing on AI-related content. It is designed for business owners, content creators, and AI developers looking to analyze successful articles, improve engagement, and fine-tune AI language models (LLMs). The data can be used to explore what makes articles perform well, including sentiment analysis, follower counts, and headline effectiveness.
The database includes pre-analyzed data such as sentiment scores, follower counts, and headline metadata, helping users gain insights into high-performing content.
This dataset is a valuable tool for anyone aiming to harness the power of data-driven insights to enhance their content or AI models.
Facebook
TwitterDataset Card for Evaluation run of experiment-llm/exp-3-q-r
Dataset automatically created during the evaluation run of model experiment-llm/exp-3-q-r The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 1 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest results. An… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/experiment-llm_exp-3-q-r-details.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By garage-bAInd (From Huggingface) [source]
The garage-bAInd/Open-Platypus dataset is a curated collection of data specifically designed to enhance logical reasoning skills in LLM (Legal Language Model) models. It serves as a training resource for improving the ability of these models to reason logically and provide accurate solutions or answers to various logical reasoning questions.
This dataset, which has been utilized in training the Platypus2 models, consists of multiple datasets that have undergone a meticulous filtering process. Through keyword search and the application of Sentence Transformers technique, questions with a similarity score above 80% have been eliminated, ensuring that only unique and diverse logical reasoning questions are included.
The columns in this dataset include: -
input: The input text or question that requires logical reasoning. -output: The correct answer or solution to the logical reasoning question. -instruction: Additional instructions or guidelines for solving the logical reasoning question. -data_source: The source or origin of the logical reasoning question.By utilizing this comprehensive and carefully curated dataset, LLM models can be trained more effectively to improve their logical reasoning capabilities
How to Use This Dataset: Logical Reasoning Improvement
Dataset Overview
Columns
The dataset is organized into several columns, each serving a specific purpose:
- input: The input text or question that requires logical reasoning. This column provides the initial statement or problem that needs solving.
- output: The correct answer or solution to the logical reasoning question. This column contains the expected outcome or response.
- instruction: Additional instructions or guidelines for solving the logical reasoning question. This column provides any specific guidance or steps required to arrive at the correct answer.
- data_source: The source or origin of the logical reasoning question. This column specifies where the question was obtained from.
Usage Guidelines
To make effective use of this dataset, follow these guidelines:
- Familiarize Yourself: Take time to understand and familiarize yourself with each entry in the dataset.
- Analyze Inputs: Carefully read and analyze each input text/question provided in the input column.
- Solve Using Logic: Apply logical thinking and reasoning strategies based on your understanding of each problem.
- Confirm Answers: Compare your solutions with those provided in the output column to check their accuracy.
- Follow Instructions: Always consider any additional instructions given in the instruction column while solving a problem.
- Explore Data Sources: Utilize information from different data sources mentioned in the data_source column if needed.
Remember, practice makes perfect! Continuously work through the dataset to improve your logical reasoning skills.
Please note that this guide aims to help you utilize the dataset effectively. It does not provide direct solutions or explanations for specific entries in the dataset.
Contributing and Feedback
We believe in continuous improvement! If you have any feedback or would like to contribute additional logical reasoning questions, please feel free to do so. Together, we can enhance this dataset further and promote logical reasoning skills across LLM models.
Let's get started and embark on a journey of logical reasoning improvement with this curated dataset!
- Training and evaluating logical reasoning models: The dataset can be used to train and evaluate logical reasoning models, such as Platypus2, to enhance their performance in solving a variety of logical reasoning questions.
- Benchmarking logical reasoning algorithms: Researchers and developers can use this dataset as a benchmark for testing and comparing different logical reasoning algorithms and techniques.
- Creating educational resources: The dataset can be utilized to create educational resources or platforms that focus on improving logical reasoning skills. It can serve as a valuable source of practice questions for learners looking to enhance their abilities in this area
If you use this dataset in your research, please credit the original authors. Data Source
**Licen...
Facebook
TwitterDataset Card for Evaluation run of Replete-AI/Replete-LLM-Qwen2-7b
Dataset automatically created during the evaluation run of model Replete-AI/Replete-LLM-Qwen2-7b The dataset is composed of 38 configuration(s), each one corresponding to one of the evaluated task. The dataset has been created from 9 run(s). Each run can be found as a specific split in each configuration, the split being named using the timestamp of the run.The "train" split is always pointing to the latest… See the full description on the dataset page: https://huggingface.co/datasets/open-llm-leaderboard/Replete-AI_Replete-LLM-Qwen2-7b-details.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
This dataset provides a valuable benchmark for researchers working on the identification of machine-generated versus human-written text. It features 3,983 short news summaries written by humans, sourced from the News Summary dataset (Inshorts news scraped from The Hindu, Indian Times, and The Guardian, available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary), alongside machine-generated continuations produced using the LLaMA-7b language model under three distinct decoding configurations.
The goal is to aid the development of robust detection models, facilitate studies on the stylistic differences between human and LLM-authored text, and support further research in Natural Language Processing (NLP), particularly in the domains of fake news detection, AI explainability, and text authenticity.
Human-Generated News: Original news summaries collected from the News Summary dataset (Inshorts), licensed under GPL v2.0. These summaries span from February to August 2017. The original dataset is available at: https://www.kaggle.com/datasets/sunnysai12345/news-summary
Machine-Generated News (Setting 1): Continuations generated using LLaMA-7b with balanced generation parameters: (Temperature: 1.0, Top-K: 50, Top-p: 0.9)
Machine-Generated News (Setting 2): Continuations generated using LLaMA-7b with high creativity and diversity: (Temperature: 1.5, Top-K: 100, Top-p: 0.95)
Machine-Generated News (Setting 3): Continuations generated using LLaMA-7b with conservative and deterministic settings: (Temperature: 0.7, Top-K: 20, Top-p: 0.8)
Human_News: Human-written summary from the original dataset.
Machine_Generated_Setting_1: LLaMA-7b output using balanced decoding.
Machine_Generated_Setting_2: LLaMA-7b output using creative decoding.
Machine_Generated_Setting_3: LLaMA-7b output using conservative decoding.
This dataset is intended for:
Training and evaluating classifiers to distinguish between human and machine-generated content
Studying linguistic patterns and stylistic variation introduced by different LLM sampling configurations
Analyzing the limits of creativity, coherence, and factuality in large language models
It is particularly useful for research in fake news detection, LLM evaluation, AI safety, and human-likeness scoring.
This dataset is licensed under the GNU General Public License v2.0 (GPL-2.0). This means:
You are free to use, share, and modify this dataset
You must distribute any derivative datasets under the same GPL v2.0 license
You must provide proper attribution and include a copy of the license in your distribution
The human-written news summaries are sourced from:
News Summary dataset by Kondalarao Vonteru, originally scraped from Inshorts, with news from The Hindu, Indian Times, and The Guardian. License: GPL-2.0
The machine-generated continuations were created using the LLaMA-7b model by Meta, configured with three different sampling settings.
Please refer to the included LICENSE.txt file for the full text of the GNU General Public License v2.0.
Facebook
TwitterDataset Card for dataset-preferences-llm-course-full-dataset
This dataset has been created with distilabel.
Dataset Summary
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline that generated it in distilabel using the distilabel CLI: distilabel pipeline run --config "https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset/raw/main/pipeline.yaml"
or explore the configuration: distilabel pipeline… See the full description on the dataset page: https://huggingface.co/datasets/davanstrien/dataset-preferences-llm-course-full-dataset.