Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PersonaHub Keyword Annotations
This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.
Limitations
The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.
Facebook
TwitterThis dataset was created by ATHviii
Facebook
Twittersaumyamalik/personahub-code-v2-34999-unused-gemma3 dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twittersaumyamalik/personahub-code-v2-34999-unused-qwq dataset hosted on Hugging Face and contributed by the HF Datasets community
Facebook
Twitterhttps://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
Dataset Card for PersonaHub FineWeb-Edu 4 Embeddings
This dataset has been created with distilabel.
Dataset Summary
This dataset obtains embeddings for the dataset argilla-warehouse/personahub-fineweb-edu-4-dedup, using the Alibaba-NLP/gte-large-en-v1.5 model from sentence transformers. The pipeline can be seen at: pipe_personahub_embeddings.py.
Dataset structure
This dataset contains a pipeline.yaml which can be used to reproduce the pipeline… See the full description on the dataset page: https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings.
Facebook
TwitterDataset Card for personahub-fineweb-edu-comparison
This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: pipe_personahub_compare.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison/raw/main/pipe_personahub_compare.py"
Dataset Summary
This dataset contains a pipeline.yaml which can be used to… See the full description on the dataset page: https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
GPT-4 Generated Data Reference:
(Original) https://huggingface.co/datasets/proj-persona/PersonaHub (Original Github) https://github.com/tencent-ailab/persona-hub
Facebook
TwitterPersonaHub with Business Keywords
Filter from https://huggingface.co/datasets/proj-persona/PersonaHub I am using this code to filter the dataset from datasets import load_dataset, Dataset import random
dataset = load_dataset('proj-persona/PersonaHub','persona',split='train')
def filter_keywords(example): keywords=["business","management"] persona = example['persona'].lower() return any(keyword in persona for keyword in keywords)
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Translated proj-persona/PersonaHub using nayohan/llama3-instrucTrans-enko-8b. For this dataset, we only used data that is 5000 characters or less in length and has language of English. Thanks for @proj-persona and @nayohan.
Scaling Synthetic Data Creation with 1,000,000,000 Personas
This repo releases data introduced in our paper Scaling Synthetic Data Creation with 1,000,000,000 Personas: We propose a novel persona-driven data synthesis methodology that leverages various… See the full description on the dataset page: https://huggingface.co/datasets/youjunhyeok/PersonaHub-ko.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets and materials used to analyze and replicate the results presented in our paper investigating how persona-based prompting affects the political orientations of Large Language Models (LLMs).
The repository includes files organized by model (Mistral, Llama, Qwen, and Zephyr) and experimental condition (base, right-authoritarian [ra], and left-libertarian [ll]):
*_persona_compass_base.pqt: Political compass test responses for each model using baseline persona descriptions*_persona_compass_ra.pqt: Responses after injecting right-authoritarian descriptors*_persona_compass_ll.pqt: Responses after injecting left-libertarian descriptorspersonas.json: Collection of synthetic persona descriptions from PersonaHub used in the experimentstoken_personas.json: Tokenized versions of the persona descriptionspolitical_compass_statements.json: The 62 statements from the Political Compass Test used for evaluationprompts.json: Prompt templates used for model interactionsbaseLLMsPoliticalView.json: Default political orientations of the models without persona promptingThe code used to analyze this data and reproduce the results presented in the paper can be found at: https://github.com/d-lab/llm-political-personas
After downloading, organize the files as follows:
Place all the configuration files in the data/raw/ directory.
Rename all model-specific .pqt files to persona_compass.pqt and place them in their respective directories:
data/processed/Llama-3.1-8B-Instruct/base/persona_compass.pqtdata/processed/Mistral-7B-Instruct-v0.3/base/persona_compass.pqtdata/processed/Qwen2.5-7B-Instruct/base/persona_compass.pqtdata/processed/zephyr-7b-beta/base/persona_compass.pqtdata/processed/Llama-3.1-8B-Instruct/right_authoritarian_personas/persona_compass.pqtdata/processed/Mistral-7B-Instruct-v0.3/right_authoritarian_personas/persona_compass.pqtdata/processed/Qwen2.5-7B-Instruct/right_authoritarian_personas/persona_compass.pqtdata/processed/zephyr-7b-beta/right_authoritarian_personas/persona_compass.pqtdata/processed/Llama-3.1-8B-Instruct/left_libertarian_personas/persona_compass.pqtdata/processed/Mistral-7B-Instruct-v0.3/left_libertarian_personas/persona_compass.pqtdata/processed/Qwen2.5-7B-Instruct/left_libertarian_personas/persona_compass.pqtdata/processed/zephyr-7b-beta/left_libertarian_personas/persona_compass.pqt
Facebook
TwitterThis dataset contains synthetic programming tasks and corresponding code samples generated based on diverse, machine-created programmer personas. Unlike typical AI-generated content datasets that rely on human-written prompts or completions, this collection avoids conditioning on prior human generations, aiming to reduce bias in synthetic data creation. The methodology draws inspiration from the PersonaHub dataset. We first defined 9 key features of a programmer, including:
Primary… See the full description on the dataset page: https://huggingface.co/datasets/project-droid/DroidCollection-Personas.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
TextBooksPersonaHub
Overview
The TextBooksPersonaHub dataset is an extension of the proj-persona/PersonaHub dataset, created using the technique described in the paper Textbooks Are All You Need II. This dataset contains synthetically generated "textbook-like" passages tailored in french to specific personas, aimed at enhancing language model training with high-quality and diverse content.
Dataset Creation
Source Data
The original personas… See the full description on the dataset page: https://huggingface.co/datasets/drodin/TextBooksPersonaHub-FR.
Facebook
Twitterhttps://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
Dataset Card for PersonaHub FineWeb-Edu 4 Clustering 100k
This dataset has been created with distilabel. The following figure is a map of the clusters generated from the pipeline. It's automatically generated by the TextClustering with all the information gathered. It contains 177 different clusters, which were assigned a set of 3 labels each, and the black dots correspond to those unclassified examples.
Dataset Summary
This dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/argilla/FinePersonas-v0.1-clustering-100k.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wikipedia Personas
Wikipedia Personas is a dataset constructed from paragraphs sampled from the agentlans/wikipedia-paragraphs-complete dataset using the sample_k10000 and sample_k20000 configurations. Each paragraph is paired with a persona crafted as a plausible expert, enthusiast, or stakeholder related to the content of the corresponding Wikipedia text. Personas were initially seeded with 20 handcrafted examples following the style of proj-persona/PersonaHub and then expanded… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-personas.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Diverse Persona 10K
This dataset is a length and diversity filtered subset (5%) of proj-persona/PersonaHub.
Filtering Details
persona with less than 5 spaces were removed to filter out the least informative personas such as "a supporter of Die Linke". This reduced the number of personas to 196k. Then, a maxmin sampling over cosine distances of text embeddings (from voyage-3-lite) were applied to select this 10k subset. Pairwise distance statistics before maxmin filtering.… See the full description on the dataset page: https://huggingface.co/datasets/ychen/diverse-persona-10k.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Synthetic Customer Experience Persona
Overview
The Synthetic Customer Experience Persona Dataset is a large-scale synthetic corpus of customer service personas, designed to aid in the development and evaluation of AI models for customer service applications. Inspired by Tencent AI Labs' Persona Hub, this dataset provides a diverse array of customer profiles across multiple industries.
Dataset Statistics
Total Personas: 250,000 Industries Covered: 6 (Retail… See the full description on the dataset page: https://huggingface.co/datasets/CordwainerSmith/CustomerPersonas.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
mPersonas: Multilingual Persona‑Driven Conversational Dataset
Dataset Summary
mPersonas is a multilingual open-source dataset with high-quality persona descriptions synthetically generated by DeepSeek-V3–0324. It follows a persona-driven data synthesis methodology, similar to PersonaHub.
Instances: 510,000
Total tokens: 173M
28M in personas
145M in conversations (105M in assistant turns)
Languages: 15 License: Apache 2.0
Methodology
This section… See the full description on the dataset page: https://huggingface.co/datasets/BSC-LT/m-personas.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
How to use
from datasets import load_dataset
ds = load_dataset("jaeyong2/persona-inst", split="train") ds Dataset({ features: ['Level', 'English', 'Korean', 'Thai', 'Vietnamese', 'context'], num_rows: 3006572 })
Development Process
Generate persona pair from proj-persona/PersonaHub We used Qwen/Qwen2-72B-Instruct model to generate Question.
License
Qwen/Qwen2.5-72B-Instruct :… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/persona-inst.
Facebook
TwitterHow to use
from datasets import load_dataset
ds = load_dataset("jaeyong2/ko-persona-cot-inst", split="train") ds Dataset({ features: ['content', 'text'], num_rows: 240000 })
Development Process
load Question dataset from jaeyong2/persona-inst We used Qwen/Qwen2-72B-Instruct model to generate answer with COT.
License
Qwen/Qwen2.5-72B-Instruct : https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE proj-persona/PersonaHub… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/ko-persona-cot-inst.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
PersonaHub Keyword Annotations
This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.
Limitations
The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.