Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for PERSONAS (Prism Filter)
PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link. Note that you MUST also fill out the form on our site to receive access to the full dataset. The form is available here.
Dataset Details
Dataset Description
The personas dataset is a pluralistic… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA.
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Persona-bias
Data accompanying the paper Bias Runs Deep: Implicit Reasoning Biases in Persona-Assigned LLMs at ICLR 2024. Paper || Code || Project website || License
Motivation
This is a dataset of model outputs supporting our extensive study of biases in persona-assigned LLMs. These model outputs can be used for many purposes, for instance:
developing a deeper understanding of persona-induced biases, e.g. by analyzing the inhibiting assumptions underlying model… See the full description on the dataset page: https://huggingface.co/datasets/allenai/persona-bias.
Datasets for paper "Socio-Culturally Aware Evaluation Framework for LLM-Based Content Moderation" https://arxiv.org/abs/2412.13578
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One way to steer generations from large language models (LLM) is to assign a persona: a role that describes how the user expects the LLM to behave (e.g., a helpful assistant, a teacher, a woman). This paper investigates how personas affect diverse aspects of model behavior. We assign to seven LLMs 162 personas from 12 categories spanning variables like gender, sexual orientation, and occupation. We prompt them to answer questions from five datasets covering objective (e.g., questions about math and history) and subjective tasks (e.g., questions about beliefs and values). We also compare persona’s generations to two baseline settings: a control persona setting with 30 paraphrases of “a helpful assistant” to control for models’ prompt sensitivity, and an empty persona setting where no persona is assigned. We find that for all models and datasets, personas show greater variability than the control setting and that some measures of persona behavior generalize across models.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Persona group average ranks (out of 193—162 personas + 30 control personas + no persona baseline—lower is better) for each knowledge domain. The rank of the best persona in each group is shown in parenthesis. We show in bold the top persona group for each domain and we underline the best domain of each persona group. The top ranked persona for social sciences was the social scientist persona.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for SPC: Synthetic-Persona-Chat Dataset
Abstract from the paper introducing this dataset:
High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user's character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and… See the full description on the dataset page: https://huggingface.co/datasets/google/Synthetic-Persona-Chat.
A novel large-scale multi-domain dataset for persona-based empathetic conversations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Persona ranks (out of 193, lower is better) for increasingly specialized domains. For persona groups with multiple personas we show, in addition to the average rank, the rank of the best persona in the category between parentheses.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the datasets and materials used to analyze and replicate the results presented in our paper investigating how persona-based prompting affects the political orientations of Large Language Models (LLMs).
The repository includes files organized by model (Mistral, Llama, Qwen, and Zephyr) and experimental condition (base, right-authoritarian [ra], and left-libertarian [ll]):
*_persona_compass_base.pqt
: Political compass test responses for each model using baseline persona descriptions*_persona_compass_ra.pqt
: Responses after injecting right-authoritarian descriptors*_persona_compass_ll.pqt
: Responses after injecting left-libertarian descriptorspersonas.json
: Collection of synthetic persona descriptions from PersonaHub used in the experimentstoken_personas.json
: Tokenized versions of the persona descriptionspolitical_compass_statements.json
: The 62 statements from the Political Compass Test used for evaluationprompts.json
: Prompt templates used for model interactionsbaseLLMsPoliticalView.json
: Default political orientations of the models without persona promptingThe code used to analyze this data and reproduce the results presented in the paper can be found at: https://github.com/d-lab/llm-political-personas
After downloading, organize the files as follows:
Place all the configuration files in the data/raw/
directory.
Rename all model-specific .pqt files to persona_compass.pqt
and place them in their respective directories:
data/processed/Llama-3.1-8B-Instruct/base/persona_compass.pqt
data/processed/Mistral-7B-Instruct-v0.3/base/persona_compass.pqt
data/processed/Qwen2.5-7B-Instruct/base/persona_compass.pqt
data/processed/zephyr-7b-beta/base/persona_compass.pqt
data/processed/Llama-3.1-8B-Instruct/right_authoritarian_personas/persona_compass.pqt
data/processed/Mistral-7B-Instruct-v0.3/right_authoritarian_personas/persona_compass.pqt
data/processed/Qwen2.5-7B-Instruct/right_authoritarian_personas/persona_compass.pqt
data/processed/zephyr-7b-beta/right_authoritarian_personas/persona_compass.pqt
data/processed/Llama-3.1-8B-Instruct/left_libertarian_personas/persona_compass.pqt
data/processed/Mistral-7B-Instruct-v0.3/left_libertarian_personas/persona_compass.pqt
data/processed/Qwen2.5-7B-Instruct/left_libertarian_personas/persona_compass.pqt
data/processed/zephyr-7b-beta/left_libertarian_personas/persona_compass.pqt
The ConvAI2 NeurIPS competition aimed at finding approaches to creating high-quality dialogue agents capable of meaningful open domain conversation. The ConvAI2 dataset for training models is based on the PERSONA-CHAT dataset. The speaker pairs each have assigned profiles coming from a set of 1155 possible personas (at training time), each consisting of at least 5 profile sentences, setting aside 100 never seen before personas for validation. As the original PERSONA-CHAT test set was released, a new hidden test set consisted of 100 new personas and over 1,015 dialogs was created by crowdsourced workers.
To avoid modeling that takes advantage of trivial word overlap, additional rewritten sets of the same train and test personas were crowdsourced, with related sentences that are rephrases, generalizations or specializations, rendering the task much more challenging. For example “I just got my nails done” is revised as “I love to pamper myself on a regular basis” and “I am on a diet now” is revised as “I need to lose weight.”
The training, validation and hidden test sets consists of 17,878, 1,000 and 1,015 dialogues, respectively.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Persona ranks for self-bias (out of 193), self-accuracy, overall bias, and overall accuracy.
https://choosealicense.com/licenses/cc/https://choosealicense.com/licenses/cc/
Dataset Card for PERSONAS (Prism Filter)
PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link Note that this subset is 5% of the training split of PERSONAS. The full dataset is here, strictly available for academic use. You MUST request access to the full persona dataset here.
Dataset Details… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA_subset.
This dataset was collected with the goal of assessing dialog evaluation metrics. In the paper, USR: An Unsupervised and Reference Free Evaluation Metric for Dialog (Mehri and Eskenazi, 2020), the authors collect this data to measure the quality of several existing word-overlap and embedding-based metrics, as well as their newly proposed USR metric.
SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.
Dataset Details Dataset Description SynthPAI was created using 300 GPT-4 agents seeded with individual personalities interacting with each other in a simulated online forum and consists of 103 threads and 7823 comments. For each profile, we further provide a set of personal attributes that a human could infer from the profile. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
Curated by: The dataset was created by SRILab at ETH Zurich. It was not created on behalf of any outside entity. Funded by: Two authors of this work are supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) (SERI-funded ERC Consolidator Grant). This project did, however, not receive explicit funding by SERI and was devised independently. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the SERI-funded ERC Consolidator Grant. Shared by: SRILab at ETH Zurich Language(s) (NLP): English License: CC-BY-NC-SA-4.0
Dataset Sources
Repository: https://github.com/eth-sri/SynthPAI Paper: https://arxiv.org/abs/2406.07217
Uses The dataset is intended to be used as a privacy-preserving method of (i) evaluating PAI capabilities of language models and (ii) aiding the development of potential defenses against such automated inferences.
Direct Use As in the associated paper , where we include an analysis of the personal attribute inference (PAI) capabilities of 18 state-of-the-art LLMs across different attributes and on anonymized texts.
Out-of-Scope Use The dataset shall not be used as part of any system that performs attribute inferences on real natural persons without their consent or otherwise maliciously.
Dataset Structure We provide the instance descriptions below. Each data point consists of a single comment (that can be a top-level post):
Comment
author str: unique identifier of the person writing
username str: corresponding username
parent_id str: unique identifier of the parent comment
thread_id str: unique identifier of the thread
children list[str]: unique identifiers of children comments
profile Profile: profile making the comment - described below
text str: text of the comment
guesses list[dict]: Dict containing model estimates of attributes based on the comment. Only contains attributes for which a prediction exists.
reviews dict: Dict containing human estimates of attributes based on the comment. Each guess contains a corresponding hardness rating (and certainty rating). Contains all attributes
The associated profiles are structured as follows
Profile
username str: identifier
attributes: set of personal attributes that describe the user (directly listed below)
The corresponding attributes and values are
Attributes
Age continuous [18-99] The age of a user in years.
Place of Birth tuple [city, country] The place of birth of a user. We create tuples jointly for city and country in free-text format. (field name: birth_city_country)
Location tuple [city, country] The current location of a user. We create tuples jointly for city and country in free-text format. (field name: city_country)
Education free-text We use a free-text field to describe the user's education level. This includes additional details such as the degree and major. To ensure comparability with the evaluation of prior work, we later map these to a categorical scale: high school, college degree, master's degree, PhD.
Income Level free-text [low, medium, high, very high] The income level of a user. We first generate a continuous income level in the profile's local currency. In our code, we map this to a categorical value considering the distribution of income levels in the respective profile location. For this, we roughly follow the local equivalents of the following reference levels for the US: Low (<30k USD), Middle (30-60k USD), High (60-150k USD), Very High (>150k USD).
Occupation free-text The occupation of a user, described as a free-text field.
Relationship Status categorical [single, In a Relationship, married, divorced, widowed] The relationship status of a user as one of 5 categories.
Sex categorical [Male, Female] Biological Sex of a profile.
Dataset Creation Curation Rationale SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap. We additionally conducted a user study to evaluate the quality of the synthetic comments, establishing that humans can barely distinguish between real and synthetic comments.
Source Data The dataset is fully synthetic and was created using GPT-4 agents (version gpt-4-1106-preview) seeded with individual personalities interacting with each other in a simulated online forum.
Data Collection and Processing The dataset was created by sampling comments from the agents in threads. A human then inferred a set of personal attributes from sets of comments associated with each profile. Further, it was manually reviewed to remove any offensive or inappropriate content. We give a detailed overview of our dataset-creation procedure in the corresponding paper.
Annotations
Annotations are provided by authors of the paper.
Personal and Sensitive Information
All contained personal information is purely synthetic and does not relate to any real individual.
Bias, Risks, and Limitations All profiles are synthetic and do not correspond to any real subpopulations. We provide a distribution of the personal attributes of the profiles in the accompanying paper. As the dataset has been created synthetically, data points can inherit limitations (e.g., biases) from the underlying model, GPT-4. While we manually reviewed comments individually, we cannot provide respective guarantees.
Citation BibTeX:
@misc{2406.07217, Author = {Hanna Yukhymenko and Robin Staab and Mark Vero and Martin Vechev}, Title = {A Synthetic Dataset for Personal Attribute Inference}, Year = {2024}, Eprint = {arXiv:2406.07217}, } APA:
Hanna Yukhymenko, Robin Staab, Mark Vero, Martin Vechev: “A Synthetic Dataset for Personal Attribute Inference”, 2024; arXiv:2406.07217.
Dataset Card Authors
Hanna Yukhymenko Robin Staab Mark Vero
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for PersonaChat
Dataset Description
PersonaChat is a multi-turn dialogue dataset introduced by Zhang et al. (2018) for training and evaluating persona-grounded conversational agents. Each conversation is between two crowdworkers, each assigned a randomly selected persona consisting of several simple facts. The dataset aims to assess whether models can maintain consistent character traits throughout a conversation.
Original Paper: Personalizing Dialogue… See the full description on the dataset page: https://huggingface.co/datasets/awsaf49/persona-chat.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example prompts (with an example persona) for all datasets.
We introduce a new dataset, called FoCus, that supports knowledge-grounded answers that reflect user’s persona. One of the situations in which people need different types of knowledge, based on their preferences, occurs when they travel around the world.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Differences between the average accuracy (across all personas) and the accuracy of personas when answering questions involving their own demographic.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Differences between the frequency that each demographic is selected as the answer by the persona of the same demographic and on average (across all personas).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Current researches focus on understanding influencer marketing, the theories behind it and factors that contributes to the success of such campaigns. Even though many articles and research papers do acknowledge that relationship of the 2 parties is important and essential for influencer marketing, however, very few or no researches directly conduct empirical analysis on whether relationship between KOLs and their followers indeed influence and to what magnitude influence the success of influencer marketing campaigns and eventually impacting brand’s choice of marketing tactic or KOLs to choose. This study in the form of case study with KOLs on Instagram and Red platform will help to fill this void by addressing this issue which is underexplored currently and provide a deep-dive into the relationship between influencers and its followers and the impact on their followers. Together with the deep-dive, the paper will also include researches on other factors that will affect the effectiveness of influencer marketing. Empirical evidence from this research confirms that KOLs’ ability to influence their followers will impact the outcome of influencer marketing, but only effective through certain methods. Specifically, focusing on two largest social platform, Instagram and Red, the paper found that post content alignment with KOL’s persona and write-up or message interactivity are two influencing factors in determining the success of influencer marketing. While other factors such as relationship built between the KOL and followers does not seem to influence the outcome of future campaigns, potentially suggesting that past relationship built between the KOL and her followers has a short-horizon of influences, as the benefits of strong relationship with the followers seem not carry forward. The findings in this paper offer marketers and KOLs theoretical guidance for conducting influencer marketing campaigns on Instagram and Red as well as in the global and China market. Keywords: Brand marketing strategy, Influencer Marketing, key opinion leader, Social Media Platforms, Consumer BehaviorM
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Dataset Card for PERSONAS (Prism Filter)
PERSONAS (Prism filter) is one of the largest datasets of synthetic preferences, with over 200k preferences over thousands of questions and 1k personas. Details on the PERSONAS dataset can be found here paper link. Note that you MUST also fill out the form on our site to receive access to the full dataset. The form is available here.
Dataset Details
Dataset Description
The personas dataset is a pluralistic… See the full description on the dataset page: https://huggingface.co/datasets/SynthLabsAI/PERSONA.