The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo.
In 2022, there were around **** billion social media users in China. Despite Facebook, YouTube, and Twitter being blocked in the country, local social networking sites such as WeChat and Weibo have been attracting millions of users, making China the world’s biggest social media market.
What is the role of social media in China? Around ** percent of the Chinese population use internet. Social networking plays a huge role among netizens, especially the younger generation. Chinese social media, just like Western equivalents, not only serves as a way to communicate online, but also as one of the main sources of news and entertainment, e-payments, shopping advisors, and dating channels. In 2021, over ** percent of surveyed social media users said they mostly appreciated that social networks help them to keep in touch with friends and family, but also share their life moments and thoughts.
What are the most popular social media platforms?
WeChat (Weixin in Chinese) is by far the most commonly seen social app in the country, used for anything from texting/calling to photo and video sharing, dating, financial services, game-playing, shopping, ride hailing, and so on. However, Chinese social media scene is quite diverse and dynamic, therefore, it is not just about WeChat. Instant messaging app Tencent QQ, microblogging site Weibo, video sharing app Youku Tudou, short-form video app Douyin (aka TikTok), photo editing and sharing app Meitu, restaurant recommendation and food ordering platform Meituan, Quora equivalent Zhihu, and dating app Momo are just a few among the most popular Chinese social media examples.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The proliferation of social media and digital technologies has made it necessary for governments to expand their focus beyond propaganda content in order to disseminate propaganda effectively. We identify a strategy of using clickbait to increase the visibility of political propaganda. We show that such a strategy is used across China by combining ethnography with a computational analysis of a novel dataset of the titles of 197,303 propaganda posts made by 213 Chinese city-level governments on WeChat. We find that Chinese propagandists face intense pressures to demonstrate their effectiveness on social media because their work is heavily quantified---measured, analyzed, and ranked---with metrics such as views and likes. Propagandists use both clickbait and non-propaganda content (e.g., lifestyle tips) to capture clicks, but rely more heavily on clickbait because it does not decrease space available for political propaganda. Government propagandists use clickbait at a rate commensurate with commercial and celebrity social media accounts. Clickbait is associated with more views and likes, and greater reach of government propaganda outlets and messages. These results reveal how the advertising-based business model and affordances of social media influence political propaganda and how government strategies to control information are moving beyond censorship, propaganda, and disinformation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Internet Usage: Social Media Market Share: All Platforms: Mixi data was reported at 0.000 % in 25 May 2024. This stayed constant from the previous number of 0.000 % for 24 May 2024. Internet Usage: Social Media Market Share: All Platforms: Mixi data is updated daily, averaging 0.000 % from May 2024 (Median) to 25 May 2024, with 8 observations. The data reached an all-time high of 0.060 % in 22 May 2024 and a record low of 0.000 % in 25 May 2024. Internet Usage: Social Media Market Share: All Platforms: Mixi data remains active status in CEIC and is reported by Statcounter Global Stats. The data is categorized under Global Database’s Hong Kong SAR (China) – Table HK.SC.IU: Internet Usage: Social Media Market Share.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Social media can be both a source of information and misinformation during health emergencies. During the COVID-19 pandemic, social media became a ubiquitous tool for people to communicate and represents a rich source of data researchers can use to analyse users’ experiences, knowledge and sentiments. Research on social media posts during COVID-19 has identified, to date, the perpetuity of traditional gendered norms and experiences. Yet these studies are mostly based on Western social media platforms. Little is known about gendered experiences of lockdown communicated on non-Western social media platforms. Using data from Weibo, China’s leading social media platform, we examine gendered user patterns and sentiment during the first wave of the pandemic between 1 January 2020 and 1 July 2020. We find that Weibo posts by self-identified women and men conformed with some gendered norms identified on other social media platforms during the COVID-19 pandemic (posting patterns and keyword usage) but not all (sentiment). This insight may be important for targeted public health messaging on social media during future health emergencies.To cite: Gan CCR, Feng SA, Feng H, et al. #WuhanDiary and #WuhanLockdown: gendered posting patterns and behaviours on Weibo during the COVID-19 pandemic. BMJ Global Health 2022;0:e008149. doi:10.1136/bmjgh-2021-008149
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.1/customlicense?persistentId=doi:10.7910/DVN1/22691https://dataverse.harvard.edu/api/datasets/:persistentId/versions/4.1/customlicense?persistentId=doi:10.7910/DVN1/22691
We offer the first large scale, multiple source analysis of the outcome of what may be the most extensive effort to selectively censor human expression ever implemented. To do this, we have devised a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services all over China before the Chinese government is able to find, evaluate, and censor (i.e., remove from the Internet) the large subset they deem objectionable. Using modern computer-assisted text analytic methods that we adapt to and validate in the Chinese language, we compare the substantive content of posts censored to those not censored over time in each of 85 topic areas. Contrary to previous understandings, posts with negative, even vitriolic, criticism of the state, its leaders, and its policies are not more likely to be censored. Instead, we show that the censorship program is aimed at curtailing collective action by silencing comments that represent, reinforce, or spur social mobilization, regardless of content. Censorship is oriented toward attempting to forestall collective activities that are occurring now or may occur in the future --- and, as such, seem to clearly expose government intent. Notes: Please see our followup article published in Science, "Reverse-Engineering Censorship In China: Randomized Experimentation And Participant Observation." See also: Automated Text Analysis
From the beginning of 2020 to April 8th (the day Wuhan reopened), this dataset summarizes the social media hotspots and what people focused in the mainland of China, as well as the epidemic development trend during this period. The dataset containing four .csv files covers most social media platforms in the mainland: Sina Weibo, TikTok, Toutiao and Douban.
a platform based on fostering user relationships to share, disseminate and receive information. Through either the website or the mobile app, users can upload pictures and videos publicly for instant sharing, with other users being able to comment with text, pictures and videos, or use a multimedia instant messaging service. The company initially invited a large number of celebrities to join the platform at the beginning, and has since invited many media personalities, government departments, businesses and non-governmental organizations to open accounts as well for the purpose of publishing and communicating information. To avoid the impersonation of celebrities, Sina Weibo uses verification symbols; celebrity accounts have an orange letter "V" and organizations' accounts have a blue letter "V". Sina Weibo has more than 500 million registered users;[12] out of these, 313 million are monthly active users, 85% use the Weibo mobile app, 70% are college-aged, 50.10% are male and 49.90% are female. There are over 100 million messages posted by users each day. With 90 million followers, actress Xie Na holds the record for the most followers on the platform. Despite fierce competition among Chinese social media platforms, Sina Weibo has proven to be the most popular; part of this success may be attributable to the wider use of mobile technologies in China.[https://en.wikipedia.org/wiki/Sina_Weibo]
Douyin (English: TikTok), referred to as TikTok, is a short-video social application on mobile phones. Users can record 15-second short videos, which can easily complete mouth-to-mouth (to mouth), and built-in special effects The user can leave a message to the video. Since September 2016, Toutiao has been launched online and is positioned as a short music video community suitable for Chinese young people. The application is vertical music UGC short videos, and the number of users has grown rapidly since 2017. In June 2018, Douyin reached 500 million monthly active users worldwide and 150 million daily active users in China. [https://zh.wikipedia.org/wiki/%E6%8A%96%E9%9F%B3]
Toutiao or Jinri Toutiao is a Chinese news and information content platform, a core product of the Beijing-based company ByteDance. By analyzing the features of content, users and users’ interaction with content, the company's algorithm models generate a tailored feed list of content for each user. Toutiao is one of China's largest mobile platforms of content creation, aggregation and distribution underpinned by machine learning techniques, with 120 million daily active users as of September 2017. [https://en.wikipedia.org/wiki/Toutiao]
Douban.com (Chinese: 豆瓣; pinyin: Dòubàn), launched on March 6, 2005, is a Chinese social networking service website that allows registered users to record information and create content related to film, books, music, recent events, and activities in Chinese cities. It could be seen as one of the most influential web 2.0 websites in China. Douban also owns an internet radio station, which ranks No.1 in the iOS App Store in 2012. Douban was formerly open to both registered and unregistered users. For registered users, the site recommends potentially interesting books, movies, and music to them in addition to serving as a social network website such as WeChat, Weibo and record keeper; for unregistered users, the site is a place to find ratings and reviews of media. Douban has about 200 million registered users as of 2013. The site serves pan-Chinese users, and its contents are in Chinese. It covers works and media in Chinese and in foreign languages. Some Chinese authors and critics register their official personal pages on the site. [https://en.wikipedia.org/wiki/Douban]
Weibo realTimeHotSearchList can be regarded as a platform for gathering celebrity gossip, social life and major news. In this document, I collect the top 50 topics of the hot search list every 12 hours during the day, so there are 100 hot topics each day. These topics are converted into English by Google translation, although the translation effect is not ideal due to sentence segmentation and language background deviation. In this document, I created a new column ['Coron-Related ( 1 yes, 0 not ) '] to mark topics related to the new crown, if relevant, it is marked as 1, if not then marked empty or 0. The google translation is extremely inaccurate (so maybe google the Chinese title to confirm is the best bet...
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Supporting data for the paper "The Role of 'State Endorsers' in Extending Chinese Propaganda: Evaluating the Reach of Pro-Regime YouTubers" published in the International Journal of Communication (IJOC) in 2023.A previous version of this paper was presented at the 72nd Annual International Communication Association (ICA) Conference on 26-30 May 2022.The dataset contains the code and raw data used in the project, as well as all the resulting graphics.Please feel free to reach out with any questions. Also please let me know if you use the data and send me a link to your publication.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains geotagged social media images of China's terraces, sourced from the Sina Weibo microblogging platform (https://weibo.com). Geo-tagged images were collected using Weibo cookies and Python-based scraping tools (available at: https://github.com/dataabc/weibo-search). The search keyword used was "terraces", and the collection timeframe spanned from July 2022 to June 2024. We included only images with clear geographic information located within China. Images of poor quality (e.g., synthesized from multiple images, excessively cluttered, or blurry) and irrelevant content such as advertisements, paintings, or text were removed.This dataset classified the images into seven distinct categories to represent different types of cultural ecosystem services (CES): landscape, species, structures, indoor, food, activities, and posing. Specifically: (1) Landscape images depict open natural landscapes, such as rice terraces, often with a visible sky. (2) Species images consist of close-up shots of animals or plants. (3) Structures images mainly feature man-made structures, often traditional houses. (4) Indoor images show the interiors of buildings, including dining rooms, bedrooms, etc.. (5) Food images are classified as images depicting food, dishes, and beverages. (6) Activities images capture people physically interacting with the environment, including group photos and folkloric activities. (7) Posing images show people looking into the camera.This dataset includes a subset of 2,720 randomly selected and manually labeled images, accounting for approximately 5% of the total collected images. Among them, landscape images were the most numerous (1,347), followed by structures (408), activities (480), posing (146), food (153), species (111), and indoor (75).These images can be used for training classification models. All code used for model training and testing is available at: https://github.com/chen7092/Deep-learning-for-cultural-ecosystem-services-of-terraces.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains the replication-ready survey dataset used in Zhao & Liu (2025): *Explaining the Trust Paradox: How Foreign Media Strengthens Government Confidence via Political–Economic Awareness in China*.
* `data.csv` – Cleaned respondent-level dataset (N = 3,788; 110 variables) used for all analyses reported in the article. Personally identifiable information has been removed in compliance with the CNSDA license.
* `codebook.pdf` – Variable names, wording, scales, and basic descriptive statistics. *(to be added by uploader if needed)*
The dataset originates from the publicly available "2021 Internet Users' Social Awareness Survey" released by the Chinese National Survey Data Archive (CNSDA). We performed basic cleaning (variable renaming, numeric recoding, and removal of direct identifiers). Cleaning scripts are available upon request.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset accompanies the manuscript “Explaining the Trust Paradox: How Foreign Media Strengthens Government Confidence via Political–Economic Awareness—and for Whom—in China” (Zhao & Liu, 2025, submitted to Political Communication). It contains the fully de-identified, replication-ready microdata from the 2021 Internet Users’ Social Awareness Survey (N = 3,788), as well as metadata and documentation files.
Contents:
Provenance:
The original survey was conducted by the Chinese National Survey Data Archive (CNSDA). This version has been processed to ensure full compliance with data protection and journal transparency requirements.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This study constructed a dataset of online media in Gansu Province from 2013 to 2022, with data from six major online media platforms in Linxia Hui Autonomous Prefecture and Gannan Tibetan Autonomous Prefecture, including Linxia Prefecture Government Website, Ethnic Daily, China Linxia Website, Shambhala Online, and China Gannan Website. The dataset covers a wide range of social, cultural, and linguistic aspects of the ethnic areas in Gansu, spanning a decade, and all the data are Chinese-language news reports and commentaries. Neologism extraction was carried out for each year's dataset, and the extracted neologisms were analyzed for their characteristics in terms of word frequency, lexicality, word number, cohesion, degrees of freedom, and neologism probability. The dataset was constructed with strict quality control measures, including manual proofreading, noise filtering, de-emphasis processing and language annotation, to ensure the accuracy and completeness of the data. This dataset is an important basic data for the study of language use, social and cultural dynamics and bilingual education development in ethnic areas, and has the value of being widely used in policy analysis, social opinion monitoring and language policy research.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
On July 30, 2020, the US President Donald Trump announced his plan to use executive orders or emergency economic powers to ban TikTok and disagreed with Microsoft’s acquisition of TikTok in the US. ByteDance, TikTok’s parent company, subsequently conducted several Chinese crisis communications on Toutiao — a platform owned by ByteDance that provides information to Chinese people. However, these announcements were reposted, sometimes rephrased or reformatted by third-party users on other Chinese social media platforms. These third-party users included both well-known influencers and general users. For example, the discussions became more salient on Sina Weibo, China’s largest online social media platform, than on any other platform, including Toutiao. Therefore, comparing crisis communications across different social media platforms is necessary. 50,702 data points were obtained for the entire dataset. Considering the efficiency of the manually labeled data, 8,793 data points were obtained after stratified random sampling of the dataset.
China Retail Investor Sentiment Analytics provides sentiment analytics of Chinese retail investors based on 2 stock forums, Guba (GACRIS dataset) and Xueqiu (XACRIS dataset), the most popular stock forums in China from 2007.
By utilizing in-house NLP models which are dedicatedly optimized for Chinese stock forum posts and trained on a proprietary manually labeled and cross-checked training data, the dataset provides accurate text analytics of post content, including but not limited to quality, sentiment, and relevant stocks with relevance score. In addition to the aggregated statistics of stock sentiment and popularity, the dataset also provides rich and fine-grained information for each user/post in record level. For example, it reports the registration time, number of followers for each user, and also the replies/readings and province being published for each post. Moreover, these meta data are processed in point-in-Time (PIT) manner since 2019.
The dataset could help clients easily capture the sentiment and popularity among millions of Chinese retail investors. On the other hand, it also offers flexibility for clients to customize novel analytics, such as studying the sentiment (conformity/divergence) of users of different level of influence or posts of different hotness, or simply filtering the posts published by users which are too active/positive/negative in a time window when aggregating the statistics.
Coverage: All A-share and Hong Kong stocks, 300+ popular US stocks Update Frequency: Daily or intra-day
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Welcome to the Open Weiboscope Data Access website. Weiboscope is a data collection and visualization project developed by the research team at the Journalism and Media Studies Centre, The University of Hong Kong (JMSC). One of the objectives of the project is to make censored Sina Weibo posts of a selected group of Chinese microbloggers publicly accessible, which enables academic use of the data for better understanding of the social media in China and making the Chinese media system more transparent. Since January 2011, the project has been regularly sampling timelines of more than 350,000 Chinese microbloggers who have more than 1,000 followers. The methodology has been detailed in an IEEE Internet Computing article (Fu, Chan, Chau, 2013). Besides, we have sampled Sina Weibo accounts randomly since 2012 and the samples' most recent timeline were collected and stored into the dataset. Our sampling approach is reported in a PLOS ONE article (Fu, Chau, 2013). This site contains all the Weiboscope data collected in the year 2012. We are delighted to share the data for open access. But for ethical reason, the data are anonymized, i.e. real user and message id are replaced by pseudo ID. When using the data, please cite the paper below. King-wa Fu, CH Chan, Michael Chau. Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and Impact Evaluation of the 'Real Name Registration' Policy. IEEE Internet Computing. 2013; 17(3): 42-50. http://doi.ieeecomputersociety.org/10.1109/MIC.2013.28 Data Set Statistics: Number of weibo messages: 226841122 Number of deleted messages: 10865955 Number of censored ('Permission Denied') messages: 86083 Number of unique weibo users: 14387628 Enquiry: Send your question/comment to weiboscope@gmail.com. The project is funded by the University of Hong Kong Seed Funding Program for Basic Research.Citation:Fu KW, Chan CH, Chau M. Assessing Censorship on Microblogs in China: Discriminatory Keyword Analysis and the Real-Name Registration Policy. Internet Computing, IEEE. 2013; 17(3): 42-50.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes six figures used for the journal paper that paper examines China's recent initiative on international social media and assesses its effectiveness in counteracting Western dominance in international communication. The figures are interactive D3.JS visualisations written in javascript and displayed in html files. The intention of uploading this dataset is enabling the readers of the paper to be able to access this tool for better understanding of the methodology described in the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset comprises responses from 1,053 online participants surveyed across China. Collected through social media and email distributions, it includes detailed queries about the usage and perceptions of generative AI, categorized into task-oriented and social-oriented applications. The dataset features demographic variables (age, gender, education, location) alongside indicators measuring both offline and online social capital. Analysis reveals that generative AI users generally possess higher social capital than non-users, with task-oriented usage enhancing offline social capital yet detracting from online interactions. Conversely, socially-oriented usage boosts social capital across both spheres. This data, rich with insights on the interplay between technology use and social structures, is pivotal for understanding technological impacts on societal dynamics and could guide future technology policy and integration strategies.
This dataset provides user reviews for mobile games, collected from TapTap, a popular mobile game community and distribution platform in China. Its primary purpose is to facilitate sentiment analysis of Chinese game reviews. The reviews are mostly in Chinese and cover the 1,000 most recent comments for 20 popular games up until April 5, 2025. Each entry includes the user's rating, the text content of the review, the number of likes the review received, the publication timestamp, the device model used (where available), the name of the game reviewed, and a sentiment label. User identifiers have been removed to protect privacy.
The dataset is typically provided in CSV format. It contains reviews for 20 distinct mobile games. While the source mentions it covers the 1,000 most recent comments, other details indicate larger counts for specific fields, with ratings having around 39,592 unique entries and sentiment labels totalling approximately 39,985 entries across positive and negative categories. The number of likes varies, with a large majority in the 0-142.75 range. The dataset structure has user identifiers removed for privacy.
This dataset is ideal for conducting sentiment analysis on Chinese mobile game reviews. It can also be used for understanding user feedback trends, identifying common issues or praises within game reviews, and developing natural language processing models tailored to Chinese text.
The dataset's geographic scope is China, as the data is collected from the TapTap platform, which is popular there. The time range for reviews spans from June 6, 2017, up to April 5, 2025, according to the latest comments. Device model information is included, though it contains inconsistencies and 'unknown' entries. No specific demographic information about the users is available.
CC-BY
This dataset is suitable for a variety of users, including: * Data scientists and machine learning engineers: For building and testing sentiment analysis models for Chinese text. * Game developers: To gain insights into player satisfaction, identify areas for improvement in their games, and understand market reception in China. * Market researchers: For analysing trends in the mobile gaming industry and understanding consumer behaviour in the Chinese market. * Academics and students: For research projects involving natural language processing, data analysis, and social media sentiment.
Original Data Source: TapTap Mobile Game Reviews (Chinese)
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This is an Open-Access database that scholars could find the primary-hand source of the networking of China Politics. If you want to update or revise the data, please contact me (Reinhardt114514@outlook.com). Contributor: Daqi (Reinhardt) Fang, Student at Hangzhou Yungu School
This article explores the effects of social media on government accountability under authoritarian regimes. It examines whether online discussions have a disciplining effect on officials’ scandals. We use a unique dataset containing records of scandals discussed on microblogs in China to systematically study their effects on the government response process and officials’ disciplining. We find that the government employs clear strategies: higher levels of online discussion lead to quicker government responses and more severe punishment of the officials involved. Scandals involving sexual and economic factors, which initially capture more attention, involve quicker responses and more severe punishments. Even when we exploit rainfall as the instrumental variable to mitigate the endogeneity, the results are still robust. Our findings highlight the accountability mechanism facilitated by social media and the power of social media empowerment.
The Weibo NER dataset is a Chinese Named Entity Recognition dataset drawn from the social media website Sina Weibo.