Synopsis
The datasets contain three files: a follower-feeds.ndjson
as input, a labels.ndjson
as output, and a celebrity-feeds.ndjson
for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id
key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.
The follower-feeds.ndjson
contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.
{"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
{"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
The celebrity-feeds.ndjson
contains the Twitter timelines of the original celebrities, formatted as:
{"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]}
{"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}
The labels.ndjson
contains the classes that should be predicted. A valid submission has to produce a labels.ndjson
given the follower-feeds.ndjson
and contain an entry for each id
given in the input.
{"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002}
{"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}
The following values are possible for each of the traits:
occupation := {sports, performer, creator, politics}
birthyear := {1940, ..., 1999}
gender := {male, female}
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes all the processed data used for experimentation in the article "Perfilado Demográficos de Celebridades en Redes Sociales" - "Demographic Profiling of Celebrities in Social Networks", published in the journal Research in Computer Science. The dataset is a processed version of the training part from the CLEF 2020 celebrity profiling task (https://pan.webis.de/clef20/pan20-web/celebrity-profiling.html). The dataset consists of 5,066,608 tweets corresponding to 1,920 Twitter celebrities. All the tweets are in English. The dataset includes several files:
The 5,066,608 tweets in English
Four files indicating the gender, age, ocuppation and user associated with each tweet.
A list of 1374 common english abreviations used in social networks
The five features extracted from the tweets and used for the experiments: words, emoticons/emojis, hashtags, ats, abreviations
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Synopsis
The datasets contain three files: a follower-feeds.ndjson
as input, a labels.ndjson
as output, and a celebrity-feeds.ndjson
for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id
key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.
The follower-feeds.ndjson
contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.
{"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
{"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
The celebrity-feeds.ndjson
contains the Twitter timelines of the original celebrities, formatted as:
{"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]}
{"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}
The labels.ndjson
contains the classes that should be predicted. A valid submission has to produce a labels.ndjson
given the follower-feeds.ndjson
and contain an entry for each id
given in the input.
{"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002}
{"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}
The following values are possible for each of the traits:
occupation := {sports, performer, creator, politics}
birthyear := {1940, ..., 1999}
gender := {male, female}