19 datasets found
  1. h

    PersonaHub-keywords

    • huggingface.co
    Updated Nov 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng (2025). PersonaHub-keywords [Dataset]. https://huggingface.co/datasets/agentlans/PersonaHub-keywords
    Explore at:
    Dataset updated
    Nov 21, 2025
    Authors
    Alan Tseng
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    PersonaHub Keyword Annotations

    This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.

      Limitations
    

    The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.

  2. PersonaHub

    • kaggle.com
    zip
    Updated Jul 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ATHviii (2024). PersonaHub [Dataset]. https://www.kaggle.com/datasets/athviii/personahub/code
    Explore at:
    zip(109868616 bytes)Available download formats
    Dataset updated
    Jul 25, 2024
    Authors
    ATHviii
    Description

    Dataset

    This dataset was created by ATHviii

    Contents

  3. h

    personahub-code-v2-34999-unused-gemma3

    • huggingface.co
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saumya Malik (2025). personahub-code-v2-34999-unused-gemma3 [Dataset]. https://huggingface.co/datasets/saumyamalik/personahub-code-v2-34999-unused-gemma3
    Explore at:
    Dataset updated
    Nov 25, 2025
    Authors
    Saumya Malik
    Description

    saumyamalik/personahub-code-v2-34999-unused-gemma3 dataset hosted on Hugging Face and contributed by the HF Datasets community

  4. h

    personahub-code-v2-34999-unused-qwq

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saumya Malik, personahub-code-v2-34999-unused-qwq [Dataset]. https://huggingface.co/datasets/saumyamalik/personahub-code-v2-34999-unused-qwq
    Explore at:
    Authors
    Saumya Malik
    Description

    saumyamalik/personahub-code-v2-34999-unused-qwq dataset hosted on Hugging Face and contributed by the HF Datasets community

  5. h

    personahub-fineweb-edu-4-embeddings

    • huggingface.co
    Updated Sep 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla Warehouse (2024). personahub-fineweb-edu-4-embeddings [Dataset]. https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 18, 2024
    Dataset authored and provided by
    Argilla Warehouse
    License

    https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/

    Description

    Dataset Card for PersonaHub FineWeb-Edu 4 Embeddings

    This dataset has been created with distilabel.

      Dataset Summary
    

    This dataset obtains embeddings for the dataset argilla-warehouse/personahub-fineweb-edu-4-dedup, using the Alibaba-NLP/gte-large-en-v1.5 model from sentence transformers. The pipeline can be seen at: pipe_personahub_embeddings.py.

      Dataset structure
    

    This dataset contains a pipeline.yaml which can be used to reproduce the pipeline… See the full description on the dataset page: https://huggingface.co/datasets/argilla-warehouse/personahub-fineweb-edu-4-embeddings.

  6. h

    personahub-fineweb-edu-comparison

    • huggingface.co
    Updated Mar 15, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Agustín Piqueres Lajarín (2011). personahub-fineweb-edu-comparison [Dataset]. https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 15, 2011
    Authors
    Agustín Piqueres Lajarín
    Description

    Dataset Card for personahub-fineweb-edu-comparison

    This dataset has been created with distilabel. The pipeline script was uploaded to easily reproduce the dataset: pipe_personahub_compare.py. It can be run directly using the CLI: distilabel pipeline run --script "https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison/raw/main/pipe_personahub_compare.py"

      Dataset Summary
    

    This dataset contains a pipeline.yaml which can be used to… See the full description on the dataset page: https://huggingface.co/datasets/plaguss/personahub-fineweb-edu-comparison.

  7. h

    PersonaHub_modified

    • huggingface.co
    Updated Jun 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yusheng Su (2024). PersonaHub_modified [Dataset]. https://huggingface.co/datasets/yushengsu/PersonaHub_modified
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 29, 2024
    Authors
    Yusheng Su
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description
  8. h

    PersonaHub_business

    • huggingface.co
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarin Suriyakoon (2024). PersonaHub_business [Dataset]. https://huggingface.co/datasets/pacozaa/PersonaHub_business
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2024
    Authors
    Sarin Suriyakoon
    Description

    PersonaHub with Business Keywords

    Filter from https://huggingface.co/datasets/proj-persona/PersonaHub I am using this code to filter the dataset from datasets import load_dataset, Dataset import random

    Load the original dataset

    dataset = load_dataset('proj-persona/PersonaHub','persona',split='train')

    def filter_keywords(example): keywords=["business","management"] persona = example['persona'].lower() return any(keyword in persona for keyword in keywords)

    Apply the… See the full description on the dataset page: https://huggingface.co/datasets/pacozaa/PersonaHub_business.

  9. h

    PersonaHub-ko

    • huggingface.co
    Updated Jul 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    유준혁 (2025). PersonaHub-ko [Dataset]. https://huggingface.co/datasets/youjunhyeok/PersonaHub-ko
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 15, 2025
    Authors
    유준혁
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Translated proj-persona/PersonaHub using nayohan/llama3-instrucTrans-enko-8b. For this dataset, we only used data that is 5000 characters or less in length and has language of English. Thanks for @proj-persona and @nayohan.

      Scaling Synthetic Data Creation with 1,000,000,000 Personas
    

    This repo releases data introduced in our paper Scaling Synthetic Data Creation with 1,000,000,000 Personas: We propose a novel persona-driven data synthesis methodology that leverages various… See the full description on the dataset page: https://huggingface.co/datasets/youjunhyeok/PersonaHub-ko.

  10. Data from: Mapping and Influencing the Political Ideology of Large Language...

    • zenodo.org
    bin, json
    Updated Feb 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pietro Bernardelle; Pietro Bernardelle; Leon Fröhling; Leon Fröhling; Stefano Civelli; Stefano Civelli; Riccardo Lunardi; Riccardo Lunardi; KEVIN ROITERO; KEVIN ROITERO; Gianluca Demartini; Gianluca Demartini (2025). Mapping and Influencing the Political Ideology of Large Language Models using Synthetic Personas [Dataset]. http://doi.org/10.5281/zenodo.14816665
    Explore at:
    bin, jsonAvailable download formats
    Dataset updated
    Feb 16, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Pietro Bernardelle; Pietro Bernardelle; Leon Fröhling; Leon Fröhling; Stefano Civelli; Stefano Civelli; Riccardo Lunardi; Riccardo Lunardi; KEVIN ROITERO; KEVIN ROITERO; Gianluca Demartini; Gianluca Demartini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the datasets and materials used to analyze and replicate the results presented in our paper investigating how persona-based prompting affects the political orientations of Large Language Models (LLMs).

    Contents

    The repository includes files organized by model (Mistral, Llama, Qwen, and Zephyr) and experimental condition (base, right-authoritarian [ra], and left-libertarian [ll]):

    Model Response Data

    • *_persona_compass_base.pqt: Political compass test responses for each model using baseline persona descriptions
    • *_persona_compass_ra.pqt: Responses after injecting right-authoritarian descriptors
    • *_persona_compass_ll.pqt: Responses after injecting left-libertarian descriptors

    Configuration and Input Files

    • personas.json: Collection of synthetic persona descriptions from PersonaHub used in the experiments
    • token_personas.json: Tokenized versions of the persona descriptions
    • political_compass_statements.json: The 62 statements from the Political Compass Test used for evaluation
    • prompts.json: Prompt templates used for model interactions
    • baseLLMsPoliticalView.json: Default political orientations of the models without persona prompting

    Related Code Repository

    The code used to analyze this data and reproduce the results presented in the paper can be found at: https://github.com/d-lab/llm-political-personas

    File Placement Instructions

    After downloading, organize the files as follows:

    Configuration and Input Files

    Place all the configuration files in the data/raw/ directory.

    Model Response Files

    Rename all model-specific .pqt files to persona_compass.pqt and place them in their respective directories:

    • Base condition files:
      • data/processed/Llama-3.1-8B-Instruct/base/persona_compass.pqt
      • data/processed/Mistral-7B-Instruct-v0.3/base/persona_compass.pqt
      • data/processed/Qwen2.5-7B-Instruct/base/persona_compass.pqt
      • data/processed/zephyr-7b-beta/base/persona_compass.pqt
    • Right-authoritarian condition files:
      • data/processed/Llama-3.1-8B-Instruct/right_authoritarian_personas/persona_compass.pqt
      • data/processed/Mistral-7B-Instruct-v0.3/right_authoritarian_personas/persona_compass.pqt
      • data/processed/Qwen2.5-7B-Instruct/right_authoritarian_personas/persona_compass.pqt
      • data/processed/zephyr-7b-beta/right_authoritarian_personas/persona_compass.pqt
    • Left-libertarian condition files:
      • data/processed/Llama-3.1-8B-Instruct/left_libertarian_personas/persona_compass.pqt
      • data/processed/Mistral-7B-Instruct-v0.3/left_libertarian_personas/persona_compass.pqt
      • data/processed/Qwen2.5-7B-Instruct/left_libertarian_personas/persona_compass.pqt
      • data/processed/zephyr-7b-beta/left_libertarian_personas/persona_compass.pqt
  11. h

    DroidCollection-Personas

    • huggingface.co
    Updated Jun 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Project Droid (2025). DroidCollection-Personas [Dataset]. https://huggingface.co/datasets/project-droid/DroidCollection-Personas
    Explore at:
    Dataset updated
    Jun 17, 2025
    Dataset authored and provided by
    Project Droid
    Description

    This dataset contains synthetic programming tasks and corresponding code samples generated based on diverse, machine-created programmer personas. Unlike typical AI-generated content datasets that rely on human-written prompts or completions, this collection avoids conditioning on prior human generations, aiming to reduce bias in synthetic data creation. The methodology draws inspiration from the PersonaHub dataset. We first defined 9 key features of a programmer, including:

    Primary… See the full description on the dataset page: https://huggingface.co/datasets/project-droid/DroidCollection-Personas.

  12. h

    TextBooksPersonaHub-FR

    • huggingface.co
    Updated Aug 2, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nacer (2024). TextBooksPersonaHub-FR [Dataset]. http://doi.org/10.57967/hf/2822
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 2, 2024
    Authors
    nacer
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    TextBooksPersonaHub

      Overview
    

    The TextBooksPersonaHub dataset is an extension of the proj-persona/PersonaHub dataset, created using the technique described in the paper Textbooks Are All You Need II. This dataset contains synthetically generated "textbook-like" passages tailored in french to specific personas, aimed at enhancing language model training with high-quality and diverse content.

      Dataset Creation
    
    
    
    
    
    
    
      Source Data
    

    The original personas… See the full description on the dataset page: https://huggingface.co/datasets/drodin/TextBooksPersonaHub-FR.

  13. h

    FinePersonas-v0.1-clustering-100k

    • huggingface.co
    Updated Oct 18, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Argilla (2024). FinePersonas-v0.1-clustering-100k [Dataset]. https://huggingface.co/datasets/argilla/FinePersonas-v0.1-clustering-100k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 18, 2024
    Dataset authored and provided by
    Argilla
    License

    https://choosealicense.com/licenses/llama3/https://choosealicense.com/licenses/llama3/

    Description

    Dataset Card for PersonaHub FineWeb-Edu 4 Clustering 100k

    This dataset has been created with distilabel. The following figure is a map of the clusters generated from the pipeline. It's automatically generated by the TextClustering with all the information gathered. It contains 177 different clusters, which were assigned a set of 3 labels each, and the black dots correspond to those unclassified examples.

      Dataset Summary
    

    This dataset has been… See the full description on the dataset page: https://huggingface.co/datasets/argilla/FinePersonas-v0.1-clustering-100k.

  14. h

    wikipedia-personas

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Tseng, wikipedia-personas [Dataset]. https://huggingface.co/datasets/agentlans/wikipedia-personas
    Explore at:
    Authors
    Alan Tseng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Wikipedia Personas

    Wikipedia Personas is a dataset constructed from paragraphs sampled from the agentlans/wikipedia-paragraphs-complete dataset using the sample_k10000 and sample_k20000 configurations. Each paragraph is paired with a persona crafted as a plausible expert, enthusiast, or stakeholder related to the content of the corresponding Wikipedia text. Personas were initially seeded with 20 handcrafted examples following the style of proj-persona/PersonaHub and then expanded… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/wikipedia-personas.

  15. h

    diverse-persona-10k

    • huggingface.co
    Updated Jan 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    yy (2025). diverse-persona-10k [Dataset]. https://huggingface.co/datasets/ychen/diverse-persona-10k
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2025
    Authors
    yy
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Diverse Persona 10K

    This dataset is a length and diversity filtered subset (5%) of proj-persona/PersonaHub.

      Filtering Details
    

    persona with less than 5 spaces were removed to filter out the least informative personas such as "a supporter of Die Linke". This reduced the number of personas to 196k. Then, a maxmin sampling over cosine distances of text embeddings (from voyage-3-lite) were applied to select this 10k subset. Pairwise distance statistics before maxmin filtering.… See the full description on the dataset page: https://huggingface.co/datasets/ychen/diverse-persona-10k.

  16. h

    CustomerPersonas

    • huggingface.co
    Updated Aug 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liran Baba (2024). CustomerPersonas [Dataset]. https://huggingface.co/datasets/CordwainerSmith/CustomerPersonas
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 6, 2024
    Authors
    Liran Baba
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Synthetic Customer Experience Persona

      Overview
    

    The Synthetic Customer Experience Persona Dataset is a large-scale synthetic corpus of customer service personas, designed to aid in the development and evaluation of AI models for customer service applications. Inspired by Tencent AI Labs' Persona Hub, this dataset provides a diverse array of customer profiles across multiple industries.

      Dataset Statistics
    

    Total Personas: 250,000 Industries Covered: 6 (Retail… See the full description on the dataset page: https://huggingface.co/datasets/CordwainerSmith/CustomerPersonas.

  17. h

    m-personas

    • huggingface.co
    Updated Feb 5, 2010
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Language Technologies Laboratory @ Barcelona Supercomputing Center (2010). m-personas [Dataset]. https://huggingface.co/datasets/BSC-LT/m-personas
    Explore at:
    Dataset updated
    Feb 5, 2010
    Dataset authored and provided by
    Language Technologies Laboratory @ Barcelona Supercomputing Center
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    mPersonas: Multilingual Persona‑Driven Conversational Dataset

      Dataset Summary
    

    mPersonas is a multilingual open-source dataset with high-quality persona descriptions synthetically generated by DeepSeek-V3–0324. It follows a persona-driven data synthesis methodology, similar to PersonaHub.

    Instances: 510,000 Total tokens: 173M
    28M in personas
    145M in conversations (105M in assistant turns)

    Languages: 15 License: Apache 2.0

      Methodology
    

    This section… See the full description on the dataset page: https://huggingface.co/datasets/BSC-LT/m-personas.

  18. h

    persona-inst

    • huggingface.co
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jeongjaeyong (2024). persona-inst [Dataset]. https://huggingface.co/datasets/jaeyong2/persona-inst
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Authors
    jeongjaeyong
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    How to use

    from datasets import load_dataset

    ds = load_dataset("jaeyong2/persona-inst", split="train") ds Dataset({ features: ['Level', 'English', 'Korean', 'Thai', 'Vietnamese', 'context'], num_rows: 3006572 })

      Development Process
    

    Generate persona pair from proj-persona/PersonaHub We used Qwen/Qwen2-72B-Instruct model to generate Question.

      License
    

    Qwen/Qwen2.5-72B-Instruct :… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/persona-inst.

  19. h

    ko-persona-cot-inst

    • huggingface.co
    Updated Oct 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jeongjaeyong (2024). ko-persona-cot-inst [Dataset]. https://huggingface.co/datasets/jaeyong2/ko-persona-cot-inst
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Authors
    jeongjaeyong
    Description

    How to use

    from datasets import load_dataset

    ds = load_dataset("jaeyong2/ko-persona-cot-inst", split="train") ds Dataset({ features: ['content', 'text'], num_rows: 240000 })

      Development Process
    

    load Question dataset from jaeyong2/persona-inst We used Qwen/Qwen2-72B-Instruct model to generate answer with COT.

      License
    

    Qwen/Qwen2.5-72B-Instruct : https://huggingface.co/Qwen/Qwen2-72B-Instruct/blob/main/LICENSE proj-persona/PersonaHub… See the full description on the dataset page: https://huggingface.co/datasets/jaeyong2/ko-persona-cot-inst.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Alan Tseng (2025). PersonaHub-keywords [Dataset]. https://huggingface.co/datasets/agentlans/PersonaHub-keywords

PersonaHub-keywords

agentlans/PersonaHub-keywords

Explore at:
Dataset updated
Nov 21, 2025
Authors
Alan Tseng
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

PersonaHub Keyword Annotations

This dataset contains the first 25 000 personas from the proj-persona/PersonaHub elite_persona config. Each persona has been tagged with keywords generated using the agentlans/flan-t5-small-keywords model. This dataset could be useful for looking up and generating personas related to a given topic.

  Limitations

The original dataset contains personas generated en masse and some may be inconsistent or of uneven quality The keyword… See the full description on the dataset page: https://huggingface.co/datasets/agentlans/PersonaHub-keywords.

Search
Clear search
Close search
Google apps
Main menu