19 datasets found

h
PersonaHub-keywords
huggingface.co
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng (2025). PersonaHub-keywords [Dataset]. https://huggingface.co/datasets/agentlans/PersonaHub-keywords
Explore at:
Dataset updated
Nov 21, 2025
Authors
Alan Tseng
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
PersonaHub Keyword Annotations

This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.

Limitations

The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.
PersonaHub
kaggle.com
zip
Updated Jul 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ATHviii (2024). PersonaHub [Dataset]. https://www.kaggle.com/datasets/athviii/personahub/code
Explore at:
zip(109868616 bytes)Available download formats
Dataset updated
Jul 25, 2024
Authors
ATHviii
Description
Dataset

This dataset was created by ATHviii

Contents
h
personahub-code-v2-34999-unused-gemma3
huggingface.co
Updated Nov 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saumya Malik (2025). personahub-code-v2-34999-unused-gemma3 [Dataset]. https://huggingface.co/datasets/saumyamalik/personahub-code-v2-34999-unused-gemma3
Explore at:
Dataset updated
Nov 25, 2025
Authors
Saumya Malik
Description
saumyamalik/personahub-code-v2-34999-unused-gemma3 dataset hosted on Hugging Face and contributed by the HF Datasets community
h
personahub-code-v2-34999-unused-qwq
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saumya Malik, personahub-code-v2-34999-unused-qwq [Dataset]. https://huggingface.co/datasets/saumyamalik/personahub-code-v2-34999-unused-qwq
Explore at:
Authors
Saumya Malik
Description
saumyamalik/personahub-code-v2-34999-unused-qwq dataset hosted on Hugging Face and contributed by the HF Datasets community
h
personahub-fineweb-edu-4-embeddings
huggingface.co
Updated Sep 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla Warehouse (2024). personahub-fineweb-edu-4-embeddings [Dataset]. https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 18, 2024
Dataset authored and provided by
Argilla Warehouse
License
https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
Description
Dataset Card for PersonaHub FineWeb-Edu 4 Embeddings

This dataset has been created with distilabel.

Dataset Summary

This dataset obtains embeddings for the dataset argilla-warehouse/personahub-fineweb-edu-4-dedup, using the Alibaba-NLP/gte-large-en-v1.5 model from sentence transformers. The pipeline can be seen at: pipe_personahub_embeddings.py.

Dataset structure

This dataset contains a pipeline.yaml which can be used to reproduce the pipeline… See the full description on the dataset page: https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings.
h
personahub-fineweb-edu-comparison
huggingface.co
Updated Mar 15, 2011
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Agustín Piqueres Lajarín (2011). personahub-fineweb-edu-comparison [Dataset]. https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 15, 2011
Authors
Agustín Piqueres Lajarín
Description
Dataset Card for personahub-fineweb-edu-comparison

This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: pipe_personahub_compare.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison/raw/main/pipe_personahub_compare.py"

Dataset Summary

This dataset contains a pipeline.yaml which can be used to… See the full description on the dataset page: https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison.
h
PersonaHub_modified
huggingface.co
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yusheng Su (2024). PersonaHub_modified [Dataset]. https://huggingface.co/datasets/yushengsu/PersonaHub_modified
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 29, 2024
Authors
Yusheng Su
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
GPT-4 Generated Data Reference:

(Original) https://huggingface.co/datasets/proj-persona/PersonaHub (Original Github) https://github.com/tencent-ailab/persona-hub
h
PersonaHub_business
huggingface.co
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarin Suriyakoon (2024). PersonaHub_business [Dataset]. https://huggingface.co/datasets/pacozaa/PersonaHub_business
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 23, 2024
Authors
Sarin Suriyakoon
Description
PersonaHub with Business Keywords

Filter from https://huggingface.co/datasets/proj-persona/PersonaHub I am using this code to filter the dataset from datasets import load_dataset, Dataset import random

Load the original dataset

dataset = load_dataset('proj-persona/PersonaHub','persona',split='train')

def filter_keywords(example): keywords=["business","management"] persona = example['persona'].lower() return any(keyword in persona for keyword in keywords)

Apply the… See the full description on the dataset page: https://huggingface.co/datasets/pacozaa/PersonaHub_business.
h
PersonaHub-ko
huggingface.co
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
유준혁 (2025). PersonaHub-ko [Dataset]. https://huggingface.co/datasets/youjunhyeok/PersonaHub-ko
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 15, 2025
Authors
유준혁
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Translated proj-persona/PersonaHub using nayohan/llama3-instrucTrans-enko-8b. For this dataset, we only used data that is 5000 characters or less in length and has language of English. Thanks for @proj-persona and @nayohan.

Scaling Synthetic Data Creation with 1,000,000,000 Personas

This repo releases data introduced in our paper Scaling Synthetic Data Creation with 1,000,000,000 Personas: We propose a novel persona-driven data synthesis methodology that leverages various… See the full description on the dataset page: https://huggingface.co/datasets/youjunhyeok/PersonaHub-ko.
Data from: Mapping and Influencing the Political Ideology of Large Language...
zenodo.org
bin, json
Updated Feb 16, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pietro Bernardelle; Pietro Bernardelle; Leon Fröhling; Leon Fröhling; Stefano Civelli; Stefano Civelli; Riccardo Lunardi; Riccardo Lunardi; KEVIN ROITERO; KEVIN ROITERO; Gianluca Demartini; Gianluca Demartini (2025). Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas [Dataset]. http://doi.org/10.5281/zenodo.14816665
Explore at:
bin, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14816665
Dataset updated
Feb 16, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Pietro Bernardelle; Pietro Bernardelle; Leon Fröhling; Leon Fröhling; Stefano Civelli; Stefano Civelli; Riccardo Lunardi; Riccardo Lunardi; KEVIN ROITERO; KEVIN ROITERO; Gianluca Demartini; Gianluca Demartini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains the datasets and materials used to analyze and replicate the results presented in our paper investigating how persona-based prompting affects the political orientations of Large Language Models (LLMs).

Contents

The repository includes files organized by model (Mistral, Llama, Qwen, and Zephyr) and experimental condition (base, right-authoritarian [ra], and left-libertarian [ll]):

Model Response Data

*_persona_compass_base.pqt: Political compass test responses for each model using baseline persona descriptions

*_persona_compass_ra.pqt: Responses after injecting right-authoritarian descriptors

*_persona_compass_ll.pqt: Responses after injecting left-libertarian descriptors

Configuration and Input Files

personas.json: Collection of synthetic persona descriptions from PersonaHub used in the experiments

token_personas.json: Tokenized versions of the persona descriptions

political_compass_statements.json: The 62 statements from the Political Compass Test used for evaluation

prompts.json: Prompt templates used for model interactions

baseLLMsPoliticalView.json: Default political orientations of the models without persona prompting

Related Code Repository

The code used to analyze this data and reproduce the results presented in the paper can be found at: https://github.com/d-lab/llm-political-personas

File Placement Instructions

After downloading, organize the files as follows:

Configuration and Input Files

Place all the configuration files in the data/raw/ directory.

Model Response Files

Rename all model-specific .pqt files to persona_compass.pqt and place them in their respective directories:

Base condition files:

data/processed/Llama-3.1-8B-Instruct/base/persona_compass.pqt

data/processed/Mistral-7B-Instruct-v0.3/base/persona_compass.pqt

data/processed/Qwen2.5-7B-Instruct/base/persona_compass.pqt

data/processed/zephyr-7b-beta/base/persona_compass.pqt

Right-authoritarian condition files:

data/processed/Llama-3.1-8B-Instruct/right_authoritarian_personas/persona_compass.pqt

data/processed/Mistral-7B-Instruct-v0.3/right_authoritarian_personas/persona_compass.pqt

data/processed/Qwen2.5-7B-Instruct/right_authoritarian_personas/persona_compass.pqt

data/processed/zephyr-7b-beta/right_authoritarian_personas/persona_compass.pqt

Left-libertarian condition files:

data/processed/Llama-3.1-8B-Instruct/left_libertarian_personas/persona_compass.pqt

data/processed/Mistral-7B-Instruct-v0.3/left_libertarian_personas/persona_compass.pqt

data/processed/Qwen2.5-7B-Instruct/left_libertarian_personas/persona_compass.pqt

data/processed/zephyr-7b-beta/left_libertarian_personas/persona_compass.pqt
h
DroidCollection-Personas
huggingface.co
Updated Jun 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Project Droid (2025). DroidCollection-Personas [Dataset]. https://huggingface.co/datasets/project-droid/DroidCollection-Personas
Explore at:
Dataset updated
Jun 17, 2025
Dataset authored and provided by
Project Droid
Description
This dataset contains synthetic programming tasks and corresponding code samples generated based on diverse, machine-created programmer personas. Unlike typical AI-generated content datasets that rely on human-written prompts or completions, this collection avoids conditioning on prior human generations, aiming to reduce bias in synthetic data creation. The methodology draws inspiration from the PersonaHub dataset. We first defined 9 key features of a programmer, including:

Primary… See the full description on the dataset page: https://huggingface.co/datasets/project-droid/DroidCollection-Personas.
h
TextBooksPersonaHub-FR
huggingface.co
Updated Aug 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nacer (2024). TextBooksPersonaHub-FR [Dataset]. http://doi.org/10.57967/hf/2822
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57967/hf/2822
Dataset updated
Aug 2, 2024
Authors
nacer
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
TextBooksPersonaHub

Overview

The TextBooksPersonaHub dataset is an extension of the proj-persona/PersonaHub dataset, created using the technique described in the paper Textbooks Are All You Need II. This dataset contains synthetically generated "textbook-like" passages tailored in french to specific personas, aimed at enhancing language model training with high-quality and diverse content.

Dataset Creation Source Data

The original personas… See the full description on the dataset page: https://huggingface.co/datasets/drodin/TextBooksPersonaHub-FR.
h
FinePersonas-v0.1-clustering-100k
huggingface.co
Updated Oct 18, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Argilla (2024). FinePersonas-v0.1-clustering-100k [Dataset]. https://huggingface.co/datasets/argilla/FinePersonas-v0.1-clustering-100k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 18, 2024
Dataset authored and provided by
Argilla
License
https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/
Description
Dataset Card for PersonaHub FineWeb-Edu 4 Clustering 100k

This dataset has been created with distilabel. The following figure is a map of the clusters generated from the pipeline. It's automatically generated by the TextClustering with all the information gathered. It contains 177 different clusters, which were assigned a set of 3 labels each, and the black dots correspond to those unclassified examples.

Dataset Summary

This dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/argilla/FinePersonas-v0.1-clustering-100k.
h
wikipedia-personas
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Tseng, wikipedia-personas [Dataset]. https://huggingface.co/datasets/agentlans/wikipedia-personas
Explore at:
Authors
Alan Tseng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Wikipedia Personas

Wikipedia Personas is a dataset constructed from paragraphs sampled from the agentlans/wikipedia-paragraphs-complete dataset using the sample_k10000 and sample_k20000 configurations. Each paragraph is paired with a persona crafted as a plausible expert, enthusiast, or stakeholder related to the content of the corresponding Wikipedia text. Personas were initially seeded with 20 handcrafted examples following the style of proj-persona/PersonaHub and then expanded… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-personas.
h
diverse-persona-10k
huggingface.co
Updated Jan 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
yy (2025). diverse-persona-10k [Dataset]. https://huggingface.co/datasets/ychen/diverse-persona-10k
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 6, 2025
Authors
yy
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Diverse Persona 10K

This dataset is a length and diversity filtered subset (5%) of proj-persona/PersonaHub.

Filtering Details

persona with less than 5 spaces were removed to filter out the least informative personas such as "a supporter of Die Linke". This reduced the number of personas to 196k. Then, a maxmin sampling over cosine distances of text embeddings (from voyage-3-lite) were applied to select this 10k subset. Pairwise distance statistics before maxmin filtering.… See the full description on the dataset page: https://huggingface.co/datasets/ychen/diverse-persona-10k.
h
CustomerPersonas
huggingface.co
Updated Aug 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liran Baba (2024). CustomerPersonas [Dataset]. https://huggingface.co/datasets/CordwainerSmith/CustomerPersonas
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 6, 2024
Authors
Liran Baba
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Synthetic Customer Experience Persona

Overview

The Synthetic Customer Experience Persona Dataset is a large-scale synthetic corpus of customer service personas, designed to aid in the development and evaluation of AI models for customer service applications. Inspired by Tencent AI Labs' Persona Hub, this dataset provides a diverse array of customer profiles across multiple industries.

Dataset Statistics

Total Personas: 250,000 Industries Covered: 6 (Retail… See the full description on the dataset page: https://huggingface.co/datasets/CordwainerSmith/CustomerPersonas.
h
m-personas
huggingface.co
Updated Feb 5, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Language Technologies Laboratory @ Barcelona Supercomputing Center (2010). m-personas [Dataset]. https://huggingface.co/datasets/BSC-LT/m-personas
Explore at:
Dataset updated
Feb 5, 2010
Dataset authored and provided by
Language Technologies Laboratory @ Barcelona Supercomputing Center
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
mPersonas: Multilingual Persona‑Driven Conversational Dataset

Dataset Summary

mPersonas is a multilingual open-source dataset with high-quality persona descriptions synthetically generated by DeepSeek-V3–0324. It follows a persona-driven data synthesis methodology, similar to PersonaHub.

Instances: 510,000 Total tokens: 173M
28M in personas
145M in conversations (105M in assistant turns)

Languages: 15 License: Apache 2.0

Methodology

This section… See the full description on the dataset page: https://huggingface.co/datasets/BSC-LT/m-personas.
h
persona-inst
huggingface.co
Updated Oct 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jeongjaeyong (2024). persona-inst [Dataset]. https://huggingface.co/datasets/jaeyong2/persona-inst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2024
Authors
jeongjaeyong
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
How to use

from datasets import load_dataset

ds = load_dataset("jaeyong2/persona-inst", split="train") ds Dataset({ features: ['Level', 'English', 'Korean', 'Thai', 'Vietnamese', 'context'], num_rows: 3006572 })

Development Process

Generate persona pair from proj-persona/PersonaHub We used Qwen/Qwen2-72B-Instruct model to generate Question.

License

Qwen/Qwen2.5-72B-Instruct :… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/persona-inst.
h
ko-persona-cot-inst
huggingface.co
Updated Oct 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jeongjaeyong (2024). ko-persona-cot-inst [Dataset]. https://huggingface.co/datasets/jaeyong2/ko-persona-cot-inst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 26, 2024
Authors
jeongjaeyong
Description
How to use

from datasets import load_dataset

ds = load_dataset("jaeyong2/ko-persona-cot-inst", split="train") ds Dataset({ features: ['content', 'text'], num_rows: 240000 })

Development Process

load Question dataset from jaeyong2/persona-inst We used Qwen/Qwen2-72B-Instruct model to generate answer with COT.

License

Qwen/Qwen2.5-72B-Instruct : https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE proj-persona/PersonaHub… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/ko-persona-cot-inst.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Alan Tseng (2025). PersonaHub-keywords [Dataset]. https://huggingface.co/datasets/agentlans/PersonaHub-keywords

PersonaHub-keywords

agentlans/PersonaHub-keywords

Explore at:

Dataset updated

Nov 21, 2025

Authors

Alan Tseng

License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

PersonaHub Keyword Annotations

This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.

  Limitations

The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.

Clear search

Close search

Google apps

Main menu

PersonaHub-keywords

PersonaHub

Dataset

Contents

personahub-code-v2-34999-unused-gemma3

personahub-code-v2-34999-unused-qwq

personahub-fineweb-edu-4-embeddings

personahub-fineweb-edu-comparison

PersonaHub_modified

PersonaHub_business

Load the original dataset

Apply the… See the full description on the dataset page: https://huggingface.co/datasets/pacozaa/PersonaHub_business.

PersonaHub-ko

Data from: Mapping and Influencing the Political Ideology of Large Language...

Contents

Model Response Data

Configuration and Input Files

Related Code Repository

File Placement Instructions

Configuration and Input Files

Model Response Files

DroidCollection-Personas

TextBooksPersonaHub-FR

FinePersonas-v0.1-clustering-100k

wikipedia-personas

diverse-persona-10k

CustomerPersonas

m-personas

persona-inst

ko-persona-cot-inst

PersonaHub-keywords

agentlans/PersonaHub-keywords