2 datasets found
  1. PAN20 Authorship Analysis: Celebrity Profiling

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Oct 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast (2023). PAN20 Authorship Analysis: Celebrity Profiling [Dataset]. http://doi.org/10.5281/zenodo.4461887
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 25, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast
    Description

    Synopsis

    • Task: Given the Twitter feeds of the followers, determine the occupation, age, and gender of a celebrity.
    • Evaluation: [code]
    • Baselines: [code]
    • See the full Shared Task [here]

    The datasets contain three files: a follower-feeds.ndjson as input, a labels.ndjson as output, and a celebrity-feeds.ndjson for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.

    The follower-feeds.ndjson contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.

    {"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
    {"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}

    The celebrity-feeds.ndjson contains the Twitter timelines of the original celebrities, formatted as:

    {"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]}
    {"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}

    The labels.ndjson contains the classes that should be predicted. A valid submission has to produce a labels.ndjson given the follower-feeds.ndjson and contain an entry for each id given in the input.

    {"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002}
    {"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}

    The following values are possible for each of the traits:

    occupation := {sports, performer, creator, politics}
    birthyear  := {1940, ..., 1999}
    gender   := {male, female}

  2. Z

    Processed data for the article "Perfilado Demográficos de Celebridades en...

    • data.niaid.nih.gov
    Updated May 18, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gomez, Juan Carlos (2021). Processed data for the article "Perfilado Demográficos de Celebridades en Redes Sociales" - "Demographic Profiling of Celebrities in Social Networks" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4767750
    Explore at:
    Dataset updated
    May 18, 2021
    Dataset provided by
    López-Santamaría, Luis-Miguel
    Gomez, Juan Carlos
    Alonso Sánchez, Juan Carlos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes all the processed data used for experimentation in the article "Perfilado Demográficos de Celebridades en Redes Sociales" - "Demographic Profiling of Celebrities in Social Networks", published in the journal Research in Computer Science. The dataset is a processed version of the training part from the CLEF 2020 celebrity profiling task (https://pan.webis.de/clef20/pan20-web/celebrity-profiling.html). The dataset consists of 5,066,608 tweets corresponding to 1,920 Twitter celebrities. All the tweets are in English. The dataset includes several files:

    1. The 5,066,608 tweets in English

    2. Four files indicating the gender, age, ocuppation and user associated with each tweet.

    3. A list of 1374 common english abreviations used in social networks

    4. The five features extracted from the tweets and used for the experiments: words, emoticons/emojis, hashtags, ats, abreviations

  3. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast (2023). PAN20 Authorship Analysis: Celebrity Profiling [Dataset]. http://doi.org/10.5281/zenodo.4461887
Organization logo

PAN20 Authorship Analysis: Celebrity Profiling

Explore at:
zipAvailable download formats
Dataset updated
Oct 25, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Matti Wiegmann; Matti Wiegmann; Benno Stein; Benno Stein; Martin Potthast; Martin Potthast
Description

Synopsis

  • Task: Given the Twitter feeds of the followers, determine the occupation, age, and gender of a celebrity.
  • Evaluation: [code]
  • Baselines: [code]
  • See the full Shared Task [here]

The datasets contain three files: a follower-feeds.ndjson as input, a labels.ndjson as output, and a celebrity-feeds.ndjson for additional study. Each file lists all celebrities as JSON objects, one per line and identified by the id key. The training dataset contains 1,920 celebrities and is balanced towards gender and occupation. The supplement dataset contains the remaining 8,265 celebrities but is not balanced in any way.

The follower-feeds.ndjson contains the English tweets of at least 10 followers for each celebrity, with at least 50 tweets each excluding retweets.

{"id": 1234, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}
{"id": 5678, "text": [["a tweet of follower 1", "another tweet of follower 1", ...], ["a tweet of follower 2", ...], ...]}

The celebrity-feeds.ndjson contains the Twitter timelines of the original celebrities, formatted as:

{"id": 1234, "text": ["a tweet of celebrity 1", "another tweet of celebrity 1", ...]}
{"id": 5678, "text": ["a tweet of celebrity 2", "another tweet", ...]}

The labels.ndjson contains the classes that should be predicted. A valid submission has to produce a labels.ndjson given the follower-feeds.ndjson and contain an entry for each id given in the input.

{"id": 1234, "occupation": "sports", "gender": "female", "birthyear": 2002}
{"id": 5678, "occupation": "professional", "gender": "male", "birthyear": 1990}

The following values are possible for each of the traits:

occupation := {sports, performer, creator, politics}
birthyear  := {1940, ..., 1999}
gender   := {male, female}

Search
Clear search
Close search
Google apps
Main menu