Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.
Facebook
TwitterThe global social media penetration rate in was forecast to continuously increase between 2024 and 2028 by in total 11.6 (+18.19 percent). After the ninth consecutive increasing year, the penetration rate is estimated to reach 75.31 and therefore a new peak in 2028. Notably, the social media penetration rate of was continuously increasing over the past years.
Facebook
TwitterA global survey conducted in the third quarter of 2024 found that the main reason for using social media was to keep in touch with friends and family, with over 50.8 percent of social media users saying this was their main reason for using online networks. Overall, 39 percent of social media users said that filling spare time was their main reason for using social media platforms, whilst 34.5 percent of respondents said they used it to read news stories. Less than one in five users were on social platforms for the reason of following celebrities and influencers.
The most popular social network
Facebook dominates the social media landscape. The world's most popular social media platform turned 20 in February 2024, and it continues to lead the way in terms of user numbers. As of February 2025, the social network had over three billion global users. YouTube, Instagram, and WhatsApp follow, but none of these well-known brands can surpass Facebook’s audience size.
Moreover, as of the final quarter of 2023, there were almost four billion Meta product users.
Ever-evolving social media usage
The utilization of social media remains largely gratuitous; however, companies have been encouraging users to become paid subscribers to reduce dependence on advertising profits. Meta Verified entices users by offering a blue verification badge and proactive account protection, among other things. X (formerly Twitter), Snapchat, and Reddit also offer users the chance to upgrade their social media accounts for a monthly free.
Facebook
TwitterSocial media companies are starting to offer users the option to subscribe to their platforms in exchange for monthly fees. Until recently, social media has been predominantly free to use, with tech companies relying on advertising as their main revenue generator. However, advertising revenues have been dropping following the COVID-induced boom. As of July 2023, Meta Verified is the most costly of the subscription services, setting users back almost 15 U.S. dollars per month on iOS or Android. Twitter Blue costs between eight and 11 U.S. dollars per month and ensures users will receive the blue check mark, and have the ability to edit tweets and have NFT profile pictures. Snapchat+, drawing in four million users as of the second quarter of 2023, boasts a Story re-watch function, custom app icons, and a Snapchat+ badge.
Facebook
TwitterContext
A Twitter dataset composed of 20,000 rows, Twitter User Data includes the following information: user name, random tweet, account profile, image, and location information.
Content
The dataset contains the following fields:
unit_id: a unique id for user
golden: whether the user was included in the gold standard for the model; TRUE or FALSE
unit_state: state of the observation; one of finalized (for contributor-judged) or golden (for gold standard observations)
trusted_judgments: number of trusted judgments (int); always 3 for non-golden, and what may be a unique id for gold standard observations
last_judgment_at: date and time of last contributor judgment; blank for gold standard observations
gender: one of male, female, or brand (for non-human profiles)
gender:confidence: a float representing confidence in the provided gender
profile_yn: "no" here seems to mean that the profile was meant to be part of the dataset but was not available when contributors went to judge it
profile_yn:confidence: confidence in the existence/non-existence of the profile
created: date and time when the profile was created
description: the user's profile description
fav_number: number of tweets the user has favorited
gender_gold: if the profile is golden, what is the gender?
link_color: the link color on the profile, as a hex value
name: the user's name
profile_yn_gold: whether the profile y/n value is golden
profileimage: a link to the profile image
retweet_count: number of times the user has retweeted (or possibly, been retweeted)
sidebar_color: color of the profile sidebar, as a hex value
text: text of a random one of the user's tweets
tweet_coord: if the user has location turned on, the coordinates as a string with the format "[latitude, longitude]"
tweet_count: number of tweets that the user has posted
tweet_created: when the random tweet (in the text column) was created
tweet_id: the tweet id of the random tweet
tweet_location: location of the tweet; seems to not be particularly normalized
user_timezone: the timezone of the user
Acknowledgements
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Facebook is a company that literally every kid is aware of. Its a household name. People from various age groups are there on this social media website. It has helped many in connecting with different people and also has helped some of the investors by earning them a good amount of money. This data set contains the details of the stock of Facebook Inc.
This data set has 7 columns with all the necessary values such as opening price of the stock, the closing price of it, its highest in the day and much more. It has date wise data of the stock starting from 2012 to 2020(August).
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Show your skills off in the Social Media Extremism Challenge @ https://www.kaggle.com/competitions/social-media-extremism-detection-challenge! Try your luck at tackling this challenging classification problem! After the competition is completed, we will be adding 200+ hand-labelled entries to this dataset so stay tuned!
We would like to thank Assistant Professor Leilani H. Gilpin (UC Santa Cruz) and the AIEA Lab for their guidance and support in the development of this dataset. —*Aditya Suresh, Anthony Lu, Vishnu Iyer*
About this data: Social media has seen an increasing rise in the quantity and intensity of extremist content throughout various different services. With cases such as the various different white supremacist movements across the world, recruitment for terrorist organizations through affiliated accounts, and a general sense of hate emerging through the modern era of polarization, it becomes increasingly vital to be able to recognize these patterns and adequately combat the harms of extremism digitally on a global scale.
Citations: Our dataset would not have been possible without the aid of an already preexisting dataset found on Kaggle, Version 1 of "Hate Speech Detection curated Dataset🤬" by Alban Nyantudre in 2023. The link can be found here: https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset/data. Accessed in 2025, it was truly essential to our work. With over 400,000 messages of real, cleaned posts, we would not have been able to source and label our data points without this crucial resource.
Classification: Our team hand labelled nearly 3,000 pieces of data from our sourced database of posts, filtering every on of them into a blanket tag of "EXTREMIST" and "NON_EXTREMIST." As many messages digitally utilize context in order to spread harmful rhetoric, we followed a general rule of classifying terms as extremist so long as they "provoked harm to a person or a group of people, whether it be through advocacy for violence, discrimination, or other hurtful sentiments, based off of a characteristic of the group."
Value of the data: This dataset can be utilized to create extremist sentiment analysis systems and machine learning algorithms, as it reflects on current linguistics, as stated by the source material for the data points themselves. In addition, it can be used as a benchmark for comparing with other extremism datasets and other extremist sentiment analysis systems.
Potential Errors: Although we feel very confident in our own labeling ability, a possibility of potentially wrong data points does exist due to the fact that these data points lack quantifiable identifiers and as such human errors are possible within the data. We do not believe this to occur often, but in full transparency is an issue that we endeavor to resolve in subsequent updates.
Facebook
TwitterPoint-of-interest (POI) is defined as a physical entity (such as a business) in a geo location (point) which may be (of interest).
We strive to provide the most accurate, complete and up to date point of interest datasets for all countries of the world. The Australian POI Dataset is one of our worldwide POI datasets with over 98% coverage.
This is our process flow:
Our machine learning systems continuously crawl for new POI data
Our geoparsing and geocoding calculates their geo locations
Our categorization systems cleanup and standardize the datasets
Our data pipeline API publishes the datasets on our data store
POI Data is in a constant flux - especially so during times of drastic change such as the Covid-19 pandemic.
Every minute worldwide on an average day over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist.
In today's interconnected world, of the approximately 200 million POIs worldwide, over 94% have a public online presence. As a new POI comes into existence its information will appear very quickly in location based social networks (LBSNs), other social media, pictures, websites, blogs, press releases. Soon after that, our state-of-the-art POI Information retrieval system will pick it up.
We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via a recurring payment plan on our data update pipeline.
The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.
The core attribute coverage for Australia is as follows:
Poi Field Data Coverage (%) poi_name 100 brand 13 poi_tel 49 formatted_address 100 main_category 94 latitude 100 longitude 100 neighborhood 3 source_url 55 email 10 opening_hours 41 building_footprint 60
The dataset may be viewed online at https://store.poidata.xyz/au and a data sample may be downloaded at https://store.poidata.xyz/datafiles/au_sample.csv
Facebook
TwitterDuring a January 2024 global survey among marketers, nearly 60 percent reported plans to increase their organic use of YouTube for marketing purposes in the following 12 months. LinkedIn and Instagram followed, respectively mentioned by 57 and 56 percent of the respondents intending to use them more. According to the same survey, Facebook was the most important social media platform for marketers worldwide.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By Twitter [source]
This dataset provides an insight into the reach and impact of Jacksepticeye's tweets. With curated content covering everything from gaming to life reflections, these tweets offer a snapshot not only of his global popularity, but also his ability to engage with an audience and ignite conversation. From each tweet, you can learn data points like its content, the number of likes it received, which replies popped up in response, how many times it was retweeted or marked as a favorite, and the overall relevance of that particular tweet in terms of its contribution to conversations worldwide. This comprehensive dataset is a great opportunity to explore the power behind Jacksepticeye's social media presence!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
This dataset is in csv format and contains information about different tweets such as their content and the response they received from audiences in terms of likes, retweets and other measures. The following columns are included:
- Tweet ID: A unique identifier for each tweet
- Tweet content: The text contained within a tweet
- Likes: Number of times a user has interacted with a specific tweet by pressing the “like” button
- Replies: Number of direct replies to the original tweet
- Re-Tweets: Number of times users have shared/re-tweeted a specific tweet
- Retweeted : Indicates whether or not it was retweeted by someone else
- Relevance : A measure on how relevant this conversation was at that particular time
This data can be used for an array of tasks such as sentiment analysis (measuring how people feel about certain topics) or network analysis (understanding who were most influential in spreading Jackseptiye's message). You could also use this data to understand any changes in engagement metrics over time or measure which topics generate greater responses from audiences.
To begin using this dataset first import it into your scripting language. After importing you can start exploring what insights could be gained with it, by asking questions such as ‘Which type of posts perform better?’ or ‘What types on conversations does Jacksepticeye tend to have?’ By focusing on one question at a time you can start looking for correlations between variables, gaining better understanding into why certain types over post perform differently than other ones . With variable manipulation techniques like select/filter you could group posts according to adhoc groups that answer your initial questions ('gaming', 'travel' etc). Once you narrow down these interests fields together with relevance indices quickly become much more easier to manage & interpret since they now operate under meaningful contexts rather than individual observations & associated figures (likes etc). Working off existing workbooks greatly increases efficiency while analysing datasets so make sure that if one exists already (and updates don't occure frequently enough) take advantage if it!
Identifying the types of content that performs best on the platform: By analyzing the engagement, reach, and popularity of tweets, marketers can determine which topics generate higher engagement and reach to inform their own strategies.
Assessing user interactions: Examining reply counts and retweet counts reveals how users interact with Jacksepticeye's posts, helping to inform a better understanding of user dynamics on Twitter.
Measuring influencer marketing ROI: Since this dataset contains the number of likes and retweets for each post, marketers can compare these values to assess the success of an influencer marketing campaign by determining whether it had a positive effect on followers' engagement with Jacksepticeye's content
If you use this dataset in your research, please credit the original authors. Data Source
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit Twitter.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The official Meta-Kaggle dataset contains the Users.csv file which contains Username, DisplayName, RegisterDate, and PerformanceTier fields but doesn't contain location data of the Kaggle Users. This dataset augments that data with additional country and region information.
I haven't included the username and displayname values on purpose, just the userid to be joined back to the Meta-Kaggle official Users.csv file.
It is possible that some users haven't inputted their details when the scraper went through their accounts and thus have missing data. Another possibility is that users may have updated their info after the scraper went through their accounts, thus resulting in inconsistencies.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains top 50 trending topics (trends) of Twitter, obtained from Twitter Trends API in an hourly rate. For each hour, there exists a row in the dataset that contains the date, time, trending topic and the related tweets count (if available). Data is for more than 97% of 2018 which our script was available.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General data recollected for the studio " Analysis of the Quantitative Impact of Social Networks on Web Traffic of Cybermedia in the 27 Countries of the European Union".
Four research questions are posed: what percentage of the total web traffic generated by cybermedia in the European Union comes from social networks? Is said percentage higher or lower than that provided through direct traffic and through the use of search engines via SEO positioning? Which social networks have a greater impact? And is there any degree of relationship between the specific weight of social networks in the web traffic of a cybermedia and circumstances such as the average duration of the user's visit, the number of page views or the bounce rate understood in its formal aspect of not performing any kind of interaction on the visited page beyond reading its content?
To answer these questions, we have first proceeded to a selection of the cybermedia with the highest web traffic of the 27 countries that are currently part of the European Union after the United Kingdom left on December 31, 2020. In each nation we have selected five media using a combination of the global web traffic metrics provided by the tools Alexa (https://www.alexa.com/), which ceased to be operational on May 1, 2022, and SimilarWeb (https:// www.similarweb.com/). We have not used local metrics by country since the results obtained with these first two tools were sufficiently significant and our objective is not to establish a ranking of cybermedia by nation but to examine the relevance of social networks in their web traffic.
In all cases, cybermedia whose property corresponds to a journalistic company have been selected, ruling out those belonging to telecommunications portals or service providers; in some cases they correspond to classic information companies (both newspapers and televisions) while in others they refer to digital natives, without this circumstance affecting the nature of the research proposed.
Below we have proceeded to examine the web traffic data of said cybermedia. The period corresponding to the months of October, November and December 2021 and January, February and March 2022 has been selected. We believe that this six-month stretch allows possible one-time variations to be overcome for a month, reinforcing the precision of the data obtained.
To secure this data, we have used the SimilarWeb tool, currently the most precise tool that exists when examining the web traffic of a portal, although it is limited to that coming from desktops and laptops, without taking into account those that come from mobile devices, currently impossible to determine with existing measurement tools on the market.
It includes:
Web traffic general data: average visit duration, pages per visit and bounce rate Web traffic origin by country Percentage of traffic generated from social media over total web traffic Distribution of web traffic generated from social networks Comparison of web traffic generated from social netwoks with direct and search procedures
Facebook
Twitterhttps://www.gesis.org/en/institute/data-usage-termshttps://www.gesis.org/en/institute/data-usage-terms
Social Media Monitoring of the German Federal Election Campaign 2017
This dataset contains results from the social media monitoring of Facebook and Twitter for the German federal election campaign 2017. The project collected the tweets and Facebook posts of political candidates and organizations and the engagement of users with these contents – retweets and @-mentions on Twitter, comments, shares and likes on Facebook. Finally, all messages on Twitter containing at least one keyword denoting central political topics were collected. All data was publicly available at the time of data collection. The collected data is proprietary and owned by Facebook and Twitter. Due to this and with respect to privacy restrictions, only the following aspects of the data can be shared:
(1) A list of all candidates that were considered in the project, their key attributes and the identification of their respective Twitter accounts and Facebook pages.
Candidate dataset: Full surname, all first names of the candidate; academic title and name pre- or suffixes (if they exist); URL of the first Facebook account; URL of the second Facebook account; URL of the Twitter account; candidate is placed on a party list; candidate’s place on the party list; candidate is a direct candidate in one of the constituencies; official number and official name of the constituency in which the candidate is running for a direct mandate; state; candidate is a member of the federal parliament (Bundestag); party of the candidate; sex, age (year of birth); place of residence; place of birth; profession.
Additionally coded was: unique ID.
(2) Lists of organizations relevant during an election campaign, i.e. political parties and important gatekeepers, along with their respective Twitter and Facebook accounts.
(3) A list of tweet IDs which can be used to retrieve the tweets we collected during our research period.
Facebook
TwitterPoint-of-interest (POI) is defined as a physical entity (such as a business) in a geo location (point) which may be (of interest).
We strive to provide the most accurate, complete and up to date point of interest datasets for all countries of the world. The Sweden POI Dataset is one of our worldwide POI datasets with over 98% coverage.
This is our process flow:
Our machine learning systems continuously crawl for new POI data
Our geoparsing and geocoding calculates their geo locations
Our categorization systems cleanup and standardize the datasets
Our data pipeline API publishes the datasets on our data store
POI Data is in a constant flux - especially so during times of drastic change such as the Covid-19 pandemic.
Every minute worldwide on an average day over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist.
In today's interconnected world, of the approximately 200 million POIs worldwide, over 94% have a public online presence. As a new POI comes into existence its information will appear very quickly in location based social networks (LBSNs), other social media, pictures, websites, blogs, press releases. Soon after that, our state-of-the-art POI Information retrieval system will pick it up.
We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via a recurring payment plan on our data update pipeline.
The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.
The core attribute coverage is as follows:
Poi Field Data Coverage (%) poi_name 100 brand 9 poi_tel 46 formatted_address 100 main_category 97 latitude 100 longitude 100 neighborhood 5 source_url 60 email 12 opening_hours 38
The dataset may be viewed online at https://store.poidata.xyz/se and a data sample may be downloaded at https://store.poidata.xyz/datafiles/se_sample.csv
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The data was obtained through the utilization of snscrape. The query used for retrieval was based on individual emojis. Relevant data was identified, and subsequently assessed for the presence of emojis as well as the sentence's adherence to English language conventions. The language detection analysis was conducted using pycld3, which was inspired by the paper "The WiLI benchmark dataset for written language identification." Each csv file consists of 20,000 distinct data entries. The file name is created based on emoji package (emoji.EMOJI_DATA) in Python.
It should be noted that given the possible occurrence of small errors associated with pycld3, along with the potential for multiple emojis per data entry, there may exist instances of non-English tweets or duplicated tweets across different CSV files.
Facebook
TwitterPoint-of-interest (POI) is defined as a physical entity (such as a business) in a geo location (point) which may be (of interest).
We strive to provide the most accurate, complete and up to date point of interest datasets for all countries of the world. The Turkey POI Dataset is one of our worldwide POI datasets with over 98% coverage.
This is our process flow:
Our machine learning systems continuously crawl for new POI data
Our geoparsing and geocoding calculates their geo locations
Our categorization systems cleanup and standardize the datasets
Our data pipeline API publishes the datasets on our data store
POI Data is in a constant flux - especially so during times of drastic change such as the Covid-19 pandemic.
Every minute worldwide on an average day over 200 businesses will move, over 600 new businesses will open their doors and over 400 businesses will cease to exist.
In today's interconnected world, of the approximately 200 million POIs worldwide, over 94% have a public online presence. As a new POI comes into existence its information will appear very quickly in location based social networks (LBSNs), other social media, pictures, websites, blogs, press releases. Soon after that, our state-of-the-art POI Information retrieval system will pick it up.
We offer our customers perpetual data licenses for any dataset representing this ever changing information, downloaded at any given point in time. This makes our company's licensing model unique in the current Data as a Service - DaaS Industry. Our customers don't have to delete our data after the expiration of a certain "Term", regardless of whether the data was purchased as a one time snapshot, or via a recurring payment plan on our data update pipeline.
The main differentiators between us vs the competition are our flexible licensing terms and our data freshness.
The core attribute coverage is as follows:
Poi Field Data Coverage (%) poi_name 100 brand 7 poi_tel 49 formatted_address 100 main_category 98 latitude 100 longitude 100 neighborhood 90 source_url 35 email 4 opening_hours 48
The dataset may be viewed online at https://store.poidata.xyz/tr and a data sample may be downloaded at https://store.poidata.xyz/datafiles/tr_sample.csv
Facebook
TwitterHow much time do people spend on social media?
As of 2024, the average daily social media usage of internet users worldwide amounted to 143 minutes per day, down from 151 minutes in the previous year. Currently, the country with the most time spent on social media per day is Brazil, with online users spending an average of three hours and 49 minutes on social media each day. In comparison, the daily time spent with social media in
the U.S. was just two hours and 16 minutes. Global social media usageCurrently, the global social network penetration rate is 62.3 percent. Northern Europe had an 81.7 percent social media penetration rate, topping the ranking of global social media usage by region. Eastern and Middle Africa closed the ranking with 10.1 and 9.6 percent usage reach, respectively.
People access social media for a variety of reasons. Users like to find funny or entertaining content and enjoy sharing photos and videos with friends, but mainly use social media to stay in touch with current events friends. Global impact of social mediaSocial media has a wide-reaching and significant impact on not only online activities but also offline behavior and life in general.
During a global online user survey in February 2019, a significant share of respondents stated that social media had increased their access to information, ease of communication, and freedom of expression. On the flip side, respondents also felt that social media had worsened their personal privacy, increased a polarization in politics and heightened everyday distractions.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In social networks, it is conventionally thought that two individuals with more overlapped friends tend to establish a new friendship, which could be stated as homophily breeding new connections. While the recent hypothesis of maximum information entropy is presented as the possible origin of effective navigation in small-world networks. We find there exists a competition between information entropy maximization and homophily in local structure through both theoretical and experimental analysis. This competition suggests that a newly built relationship between two individuals with more common friends would lead to less information entropy gain for them. We demonstrate that in the evolution of the social network, both of the two assumptions coexist. The rule of maximum information entropy produces weak ties in the network, while the law of homophily makes the network highly clustered locally and the individuals would obtain strong and trust ties. A toy model is also presented to demonstrate the competition and evaluate the roles of different rules in the evolution of real networks. Our findings could shed light on the social network modeling from a new perspective.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains tweet of the second-largest local language in Indonesia and is used for emotion classification.
Dataset Characteristics: Tabular
Subject Area: Computer Science
Associated Tasks: Classification
Instances: 2510
For what purpose was the dataset created?
This dataset is created as contribution for NLP research particularly in Indonesia
Who funded the creation of the dataset?
This dataset is self-funded
What do the instances in this dataset represent?
tweet
Are there recommended data splits?
No
Was there any data preprocessing performed?
tokenization, stopword removal, stemming
Has Missing Values?
No
Title: Sundanese Twitter Dataset for Emotion Classification
Authors: Oddy Virgantara Putra; Fathin Muhammad Wasmanson; Triana Harmini; Shoffin Nahwa Utama. 2020
Journal: Published in Conference
Link: https://ieeexplore.ieee.org/abstract/document/9297929
Sundanese is the second-largest tribe in Indonesia which possesses many dialects. This condition has gained attention for many researchers to analyze emotion especially on social media. However, with barely available Sundanese dataset, this condition makes understanding sundanese emotion is a challenging task. In this research, we proposed a dataset for emotion classification of Sundanese text. The preprocessing includes case folding, stopwords removal, stemming, tokenizing, and text representation. Prior to classification, for the feature generation, we utilize term frequency-inverse document frequency (TFIDF). We evaluated our dataset using k-Fold Cross Validation. Our experiments with the proposed method exhibit an effective result for machine learning classification. Furthermore, as far as we know, this is the first Sundanese emotion dataset available for public.
Citation: Putra,Oddy Virgantara. (2021). Sundanese Twitter Dataset. UCI Machine Learning Repository. https://doi.org/10.24432/C5MK8C.
BibTex: @misc{misc_sundanese_twitter_dataset_695,
author = {Putra,Oddy Virgantara},
title = {{Sundanese Twitter Dataset}},
year = {2021},
howpublished = {UCI Machine Learning Repository},
note = {{DOI}: https://doi.org/10.24432/C5MK8C}
}
Facebook
TwitterHow many people use social media?
Social media usage is one of the most popular online activities. In 2024, over five billion people were using social media worldwide, a number projected to increase to over six billion in 2028.
Who uses social media?
Social networking is one of the most popular digital activities worldwide and it is no surprise that social networking penetration across all regions is constantly increasing. As of January 2023, the global social media usage rate stood at 59 percent. This figure is anticipated to grow as lesser developed digital markets catch up with other regions
when it comes to infrastructure development and the availability of cheap mobile devices. In fact, most of social media’s global growth is driven by the increasing usage of mobile devices. Mobile-first market Eastern Asia topped the global ranking of mobile social networking penetration, followed by established digital powerhouses such as the Americas and Northern Europe.
How much time do people spend on social media?
Social media is an integral part of daily internet usage. On average, internet users spend 151 minutes per day on social media and messaging apps, an increase of 40 minutes since 2015. On average, internet users in Latin America had the highest average time spent per day on social media.
What are the most popular social media platforms?
Market leader Facebook was the first social network to surpass one billion registered accounts and currently boasts approximately 2.9 billion monthly active users, making it the most popular social network worldwide. In June 2023, the top social media apps in the Apple App Store included mobile messaging apps WhatsApp and Telegram Messenger, as well as the ever-popular app version of Facebook.