100+ datasets found

Z
Modern China Geospatial Database - Main Dataset
data.niaid.nih.gov
zenodo.org
+1more
Updated Feb 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christian Henriot (2025). Modern China Geospatial Database - Main Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5735393
Explore at:
Dataset updated
Feb 28, 2025
Dataset authored and provided by
Christian Henriot
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)

You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp

Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.

One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.

From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.

UPDATES

MCGD_Data2025_02_28 includes a major change with the duplication of all the locations listed under Beijing, Shanghai, Tianjin, and Chongqing (北京, 上海, 天津, 重慶) and their listing under the name of the provinces to which they belonge origially before the creation of the four special municipalities after 1949. This is meant to facilitate the matching of data from historical sources. Each location has a unique NameID. Altogether there are 472,818 entries

MCGD_Data2025_02_27 inclues an update on locations extracted from Minguo zhengfu ge yuanhui keyuan yishang zhiyuanlu 國民政府各院部會科員以上職員錄 (Directory of staff members and above in the ministries and committees of the National Government). Nanjing: Guomin zhengfu wenguanchu yinzhuju 國民政府文官處印鑄局國民政府文官處印鑄局, 1944). We also made corrections in the Prov_Py and Prov_Zh columns as there were some misalignments between the pinyin name and the name in Chines characters. The file now includes 465,128 entries.

MCGD_Data2024_03_23 includes an update on locations in Taiwan from the Asia Directories. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown").

MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.
Chinese Macroeconomic Data (2005 - 2022)
kaggle.com
Updated May 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francisco Feng (2022). Chinese Macroeconomic Data (2005 - 2022) [Dataset]. https://www.kaggle.com/datasets/franciscofeng/chinese-macroeconomic-data-20052022
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Francisco Feng
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Macroeconomic data is an important source for both institutions and companies to have a rough sense of what government's policies and economy will head to. This dataset can help macroeconomic and fundamental analysts to do research on Chinese market or macroeconomics. Quantitative researchers can also use this dataset as a reference to assist them making better strategies. The SHIBOR rate of different maturities is recorded at daily frequency. Users can construct the yield curve for economic research. Quantitative researchers can use it to see how SHIBOR influences the overall Chinese stock & fixed income market and etc. Many Chinese Indices are also very important in conducting research about Chinese market & economy. These data are also at daily frequency. Other macroeconomic data are recorded in monthly frequency and thus can be used to conduct broader area of economic and financial research and etc.
h
chinese_conversation_and_spam
huggingface.co
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Liu (2024). chinese_conversation_and_spam [Dataset]. https://huggingface.co/datasets/paulkm/chinese_conversation_and_spam
Explore at:
Dataset updated
Nov 13, 2024
Authors
Paul Liu
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Caution! This dataset contains explicit language and fraud information. Use at your own risk!

For AutoTrain use: please select Text Classification (Binary) as Task.

What is included

conversations in chinese under tag 0 spam conversations under tag1

Where does the data come from

part of the data came from conversations in Chinese Telegram groups part of them are from logging channels of anti-spam bots

How many data is included

A total of 9.9k… See the full description on the dataset page: https://huggingface.co/datasets/paulkm/chinese_conversation_and_spam.
a
Global China Data
aiddata.org
Updated Sep 29, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Global China Data [Dataset]. https://www.aiddata.org/data/aiddatas-global-chinese-development-finance-dataset-version-2-0
Explore at:
Dataset updated
Sep 29, 2021
Description
This uniquely granular dataset captures 13,427 development projects worth $843 billion financed by more than 300 Chinese government institutions and state-owned entities across 165 countries in every major region of the world from 2000-2017.
F
Chinese Open Ended Question Answer Text Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Open Ended Question Answer Text Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/chinese-open-ended-question-answer-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
The Chinese Open-Ended Question Answering Dataset is a meticulously curated collection of comprehensive Question-Answer pairs. It serves as a valuable resource for training Large Language Models (LLMs) and Question-answering models in the Chinese language, advancing the field of artificial intelligence.
Dataset Content:
This QA dataset comprises a diverse set of open-ended questions paired with corresponding answers in Chinese. There is no context paragraph given to choose an answer from, and each question is answered without any predefined context content. The questions cover a broad range of topics, including science, history, technology, geography, literature, current affairs, and more.
Each question is accompanied by an answer, providing valuable information and insights to enhance the language model training process. Both the questions and answers were manually curated by native Chinese people, and references were taken from diverse sources like books, news articles, websites, and other reliable references.
This question-answer prompt completion dataset contains different types of prompts, including instruction type, continuation type, and in-context learning (zero-shot, few-shot) type. The dataset also contains questions and answers with different types of rich text, including tables, code, JSON, etc., with proper markdown.
Question Diversity:
To ensure diversity, this Q&A dataset includes questions with varying complexity levels, ranging from easy to medium and hard. Different types of questions, such as multiple-choice, direct, and true/false, are included. Additionally, questions are further classified into fact-based and opinion-based categories, creating a comprehensive variety. The QA dataset also contains the question with constraints and persona restrictions, which makes it even more useful for LLM training.
Answer Formats:
To accommodate varied learning experiences, the dataset incorporates different types of answer formats. These formats include single-word, short phrases, single sentences, and paragraph types of answers. The answer contains text strings, numerical values, date and time formats as well. Such diversity strengthens the Language model's ability to generate coherent and contextually appropriate answers.
Data Format and Annotation Details:
This fully labeled Chinese Open Ended Question Answer Dataset is available in JSON and CSV formats. It includes annotation details such as id, language, domain, question_length, prompt_type, question_category, question_type, complexity, answer_type, rich_text.
Quality and Accuracy:
The dataset upholds the highest standards of quality and accuracy. Each question undergoes careful validation, and the corresponding answers are thoroughly verified. To prioritize inclusivity, the dataset incorporates questions and answers representing diverse perspectives and writing styles, ensuring it remains unbiased and avoids perpetuating discrimination.
Both the question and answers in Chinese are grammatically accurate without any word or grammatical errors. No copyrighted, toxic, or harmful content is used while building this dataset.
Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Continuous efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to collect custom question-answer data tailored to specific needs, providing flexibility and customization options.
License:
The dataset, created by FutureBeeAI, is now ready for commercial use. Researchers, data scientists, and developers can utilize this fully labeled and ready-to-deploy Chinese Open Ended Question Answer Dataset to enhance the language understanding capabilities of their generative ai models, improve response generation, and explore new approaches to NLP question-answering tasks.
F
Mandarin General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mandarin General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-mandarin-china
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mandarin Chinese speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of China to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Mandarin speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mandarin Chinese.

•
Voice Assistants: Build smart assistants capable of understanding natural Chinese conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
2023 Contributions of Plant Specimen Data inside China
gbif.org
demo.gbif.org
Updated Mar 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ling Yang; Ling Yang (2023). 2023 Contributions of Plant Specimen Data inside China [Dataset]. http://doi.org/10.15468/9us6fb
Explore at:
Unique identifier
https://doi.org/10.15468/9us6fb
Dataset updated
Mar 5, 2023
Dataset provided by
Global Biodiversity Information Facilityhttps://www.gbif.org/
Chinese Academy of Sciences (CAS)
Authors
Ling Yang; Ling Yang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Area covered

Description
As one of the richest biodiversity countries in the world, China has carried out the work of specimen digitalization for many years. And it has also shared millions of specimens for several times and get good results from data application and international influence in recent years. Now, it continuely makes a big publication of plant specimens this time.
S
Data from: Information dataset of China’s overseas industrial parks from...
scidb.cn
Updated Jul 4, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
李祜梅; 邬明权; 牛铮; 李旗 (2019). Information dataset of China’s overseas industrial parks from 1992 to 2018 [Dataset]. http://doi.org/10.11922/sciencedb.797
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.797
Dataset updated
Jul 4, 2019
Dataset provided by
Science Data Bank
Authors
李祜梅; 邬明权; 牛铮; 李旗
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Since China’s initiative of the Belt and Road Initiative, overseas industrial parks have become an important carrier for economic and trade cooperation and have become an important force for Chinese enterprises to go global.However, although there are many industrial parks invested by Chinese companies abroad, there is not yet a comprehensive statistical work that is crucial for national or corporate investors.The start-up time of some parks and the name of Chinese enterprises that are under construction are difficult to find, so comprehensive statistical work is relatively difficult.This paper collects data through the network crawling technology, the public number of the Belt and Road International Industrial Park, the official website of the major enterprises participating in the Belt and Road construction, and the database of the Ministry of Commerce.Under the most comprehensive collection possible, a detailed data set of the China Outland Campus Belt and Road Project from 1992 to 2018 was compiled.This data set summarizes the existing park names and determines the total number of parks currently built in China; statistics on the number of parks on each continent to understand the distribution of the park; then analyze the type of the park, and understand the distribution of resources in the area by type; finally,compare the time between the construction of the park and the time of the country where the park is located join the Asian Infrastructure Investment Bank(AIIB) to know the relationship between the AIIB and the park.
F
Chinese Chain of Thought Prompt & Response Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Chain of Thought Prompt & Response Dataset [Dataset]. https://www.futurebeeai.com/dataset/prompt-response-dataset/chinese-chain-of-thought-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Welcome to the Chinese Chain of Thought prompt-response dataset, a meticulously curated collection containing 3000 comprehensive prompt and response pairs. This dataset is an invaluable resource for training Language Models (LMs) to generate well-reasoned answers and minimize inaccuracies. Its primary utility lies in enhancing LLMs' reasoning skills for solving arithmetic, common sense, symbolic reasoning, and complex problems.

Dataset Content:
This COT dataset comprises a diverse set of instructions and questions paired with corresponding answers and rationales in the Chinese language. These prompts and completions cover a broad range of topics and questions, including mathematical concepts, common sense reasoning, complex problem-solving, scientific inquiries, puzzles, and more.

Each prompt is meticulously accompanied by a response and rationale, providing essential information and insights to enhance the language model training process. These prompts, completions, and rationales were manually curated by native Chinese people, drawing references from various sources, including open-source datasets, news articles, websites, and other reliable references.

Our chain-of-thought prompt-completion dataset includes various prompt types, such as instructional prompts, continuations, and in-context learning (zero-shot, few-shot) prompts. Additionally, the dataset contains prompts and completions enriched with various forms of rich text, such as lists, tables, code snippets, JSON, and more, with proper markdown format.

Prompt Diversity:
To ensure a wide-ranging dataset, we have included prompts from a plethora of topics related to mathematics, common sense reasoning, and symbolic reasoning. These topics encompass arithmetic, percentages, ratios, geometry, analogies, spatial reasoning, temporal reasoning, logic puzzles, patterns, and sequences, among others.

These prompts vary in complexity, spanning easy, medium, and hard levels. Various question types are included, such as multiple-choice, direct queries, and true/false assessments.

Response Formats:
To accommodate diverse learning experiences, our dataset incorporates different types of answers depending on the prompt and provides step-by-step rationales. The detailed rationale aids the language model in building reasoning process for complex questions.

These responses encompass text strings, numerical values, and date and time formats, enhancing the language model's ability to generate reliable, coherent, and contextually appropriate answers.

Data Format and Annotation Details:
This fully labeled Chinese Chain of Thought Prompt Completion Dataset is available in JSON and CSV formats. It includes annotation details such as a unique ID, prompt, prompt type, prompt complexity, prompt category, domain, response, rationale, response type, and rich text presence.

Quality and Accuracy:
Our dataset upholds the highest standards of quality and accuracy. Each prompt undergoes meticulous validation, and the corresponding responses and rationales are thoroughly verified. We prioritize inclusivity, ensuring that the dataset incorporates prompts and completions representing diverse perspectives and writing styles, maintaining an unbiased and discrimination-free stance.

The Chinese version is grammatically accurate without any spelling or grammatical errors. No copyrighted, toxic, or harmful content is used during the construction of this dataset.

Continuous Updates and Customization:
The entire dataset was prepared with the assistance of human curators from the FutureBeeAI crowd community. Ongoing efforts are made to add more assets to this dataset, ensuring its growth and relevance. Additionally, FutureBeeAI offers the ability to gather custom chain of thought prompt completion data tailored to specific needs, providing flexibility and customization options.

License:
The dataset, created by FutureBeeAI, is now available for commercial use. Researchers, data scientists, and developers can leverage this fully labeled and ready-to-deploy Chinese Chain of Thought Prompt Completion Dataset to enhance the rationale and accurate response generation capabilities of their generative AI models and explore new approaches to NLP tasks.
T
China Unemployment Rate
tradingeconomics.com
jp.tradingeconomics.com
+13more
csv, excel, json, xml
Updated Aug 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2025). China Unemployment Rate [Dataset]. https://tradingeconomics.com/china/unemployment-rate
Explore at:
csv, xml, excel, jsonAvailable download formats
Dataset updated
Aug 15, 2025
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Sep 30, 2002 - Jul 31, 2025
Area covered
China
Description
Unemployment Rate in China increased to 5.20 percent in July from 5 percent in June of 2025. This dataset provides - China Unemployment Rate - actual values, historical data, forecast, chart, statistics, economic calendar and news.
e
Contrasting English and Chinese - Dataset - B2FIND
b2find.eudat.eu
Updated Aug 26, 2004
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2004). Contrasting English and Chinese - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/bef97882-48d8-57b9-857e-54291c63f5e7
Explore at:
Dataset updated
Aug 26, 2004
Description
As an extension of the ESRC project Contrasting aspect and tense in English and Chinese (RES-000-22-0135), this project will be comparing and contrasting aspect-related grammatical categories across the two languages on the basis of four corpora of English and Chinese. The compositional nature of aspect is well recognised. Indeed, the effect of verbs and their objects on aspect has been intensively studied. Nevertheless, the potential contributions of other grammatical categories to aspect, such as negation, remain largely unexplored. As aspect is a frequent and important linguistic phenomenon, this research covers nearly the whole grammar in English and Chinese. The project will look for similarities and differences between the two languages and seek unexpected insights into the workings of the languages from the corpora used. In doing so, such issues will be explored as which grammatical categories are involved in the generation of aspect and which are not. Given that English and Chinese display many differences, in exploring this question with relation to English and Chinese the project will simultaneously establish a basis for the comparison of the two languages and provide a robust test for the account of aspect developed on this project. No data collected - existing data reformatted and annotated
T
United States Imports from China
tradingeconomics.com
csv, excel, json, xml
Updated May 29, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
TRADING ECONOMICS (2017). United States Imports from China [Dataset]. https://tradingeconomics.com/united-states/imports/china
Explore at:
xml, json, csv, excelAvailable download formats
Dataset updated
May 29, 2017
Dataset authored and provided by
TRADING ECONOMICS
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1990 - Dec 31, 2025
Area covered
United States
Description
United States Imports from China was US$462.62 Billion during 2024, according to the United Nations COMTRADE database on international trade. United States Imports from China - data, historical chart and statistics - was last updated on September of 2025.
F
Chinese Shopping List OCR Image Dataset
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Chinese Shopping List OCR Image Dataset [Dataset]. https://www.futurebeeai.com/dataset/ocr-dataset/chinese-shopping-list-ocr-image-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
What’s Included
Introducing the Chinese Shopping List Image Dataset - a diverse and comprehensive collection of handwritten text images carefully curated to propel the advancement of text recognition and optical character recognition (OCR) models designed specifically for the Chinese language.
Dataset Contain & Diversity:
Containing more than 2000 images, this Chinese OCR dataset offers a wide distribution of different types of shopping list images. Within this dataset, you'll discover a variety of handwritten text, including sentences, and individual item name words, quantity, comments, etc on shopping lists. The images in this dataset showcase distinct handwriting styles, fonts, font sizes, and writing variations.
To ensure diversity and robustness in training your OCR model, we allow limited (less than three) unique images in a single handwriting. This ensures we have diverse types of handwriting to train your OCR model on. Stringent measures have been taken to exclude any personally identifiable information (PII) and to ensure that in each image a minimum of 80% of space contains visible Chinese text.
The images have been captured under varying lighting conditions, including day and night, as well as different capture angles and backgrounds. This diversity helps build a balanced OCR dataset, featuring images in both portrait and landscape modes.
All these shopping lists were written and images were captured by native Chinese people to ensure text quality, prevent toxic content, and exclude PII text. We utilized the latest iOS and Android mobile devices with cameras above 5MP to maintain image quality. Images in this training dataset are available in both JPEG and HEIC formats.
Metadata:
In addition to the image data, you will receive structured metadata in CSV format. For each image, this metadata includes information on image orientation, country, language, and device details. Each image is correctly named to correspond with the metadata.
This metadata serves as a valuable resource for understanding and characterizing the data, aiding informed decision-making in the development of Chinese text recognition models.
Update & Custom Collection:
We are committed to continually expanding this dataset by adding more images with the help of our native Chinese crowd community.
If you require a customized OCR dataset containing shopping list images tailored to your specific guidelines or device distribution, please don't hesitate to contact us. We have the capability to curate specialized data to meet your unique requirements.
Additionally, we can annotate or label the images with bounding boxes or transcribe the text in the images to align with your project's specific needs using our crowd community.
License:
This image dataset, created by FutureBeeAI, is now available for commercial use.
Conclusion:
Leverage this shopping list image OCR dataset to enhance the training and performance of text recognition, text detection, and optical character recognition models for the Chinese language. Your journey to improved language understanding and processing begins here.
f
THUCNews Chinese News Text Classification Dataset
figshare.com
7z
Updated Jan 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jinhao zhong (2025). THUCNews Chinese News Text Classification Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.28279964.v2
Explore at:
7zAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28279964.v2
Dataset updated
Jan 25, 2025
Dataset provided by
figshare
Authors
jinhao zhong
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
THUCTC (THU Chinese Text Classification) is a Chinese text classification toolkit developed by the Natural Language Processing Laboratory of Tsinghua University. It can efficiently automate the training, evaluation, and classification of user-defined text classification corpora. Text classification typically involves three steps: feature selection, feature dimensionality reduction, and model training. Selecting appropriate text features and performing dimensionality reduction are challenging problems in Chinese text classification. Based on years of research experience in Chinese text classification, our team has chosen bigram (two-character strings) as the feature unit in THUCTC, with Chi-square as the dimensionality reduction method, tf-idf as the weight calculation method, and LibSVM or LibLinear as the classification model. THUCTC demonstrates good versatility for open-domain long texts, is independent of the performance of any Chinese word segmentation tool, and offers the advantages of high accuracy and fast testing speed.
h
instruct_chat_50k.jsonl
huggingface.co
Updated Nov 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chinese-Vicuna (2022). instruct_chat_50k.jsonl [Dataset]. https://huggingface.co/datasets/Chinese-Vicuna/instruct_chat_50k.jsonl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 29, 2022
Dataset authored and provided by
Chinese-Vicuna
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
instruct_chat_50k.jsonl which is composed of 30k Chinese sharegpt dataset and 20k alpaca-instruction-Chinese-dataset
Number of internet users in China 2005-2024
statista.com
Updated Jun 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Number of internet users in China 2005-2024 [Dataset]. https://www.statista.com/statistics/265140/number-of-internet-users-in-china/
Explore at:
Dataset updated
Jun 30, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
China
Description
In 2025, China reported adding ***million new users to its massive **** billion internet population. The first half-year data in 2024 revealed that nearly *****of the new internet users were between 10 and 18 years old, while a ***** were older adults aged above 50 years. The largest online community In 2024, China accounted for about ********* of the *** billion internet users worldwide. However, compared to its total population, China’s internet penetration rate is lower than in other Asian countries. Penetration rates in both South Korea and Japan were significantly higher. The market potential Internet usage in China is further characterized by a large regional discrepancy. In rural regions, the internet access rate is much lower than the national level. On the other side, the Chinese market is a mobile-first nation. Since 2014, more Chinese people have accessed the internet via mobile devices than computers. The number of mobile internet users in China increased steadily over the previous decade.
Online data privacy and security preferences in China Q3 2024
statista.com
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Online data privacy and security preferences in China Q3 2024 [Dataset]. https://www.statista.com/statistics/1367136/china-internet-data-privacy-and-security-user-preferences-and-actions/
Explore at:
Dataset updated
Mar 18, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
China
Description
According to a 2024 survey on digital usage in China, over a third of Chinese internet users said they used ad blockers when surfing the internet. One in five respondents expressed concerns about internet companies misusing their digital data.
S
Chinese Natural Speech Complex Emotion Dataset
scidb.cn
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaolong Wu; Mingxing Xu; Askar Hamdulla; Thomas Fang Zheng (2025). Chinese Natural Speech Complex Emotion Dataset [Dataset]. http://doi.org/10.57760/sciencedb.20968
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.20968
Dataset updated
Feb 24, 2025
Dataset provided by
Science Data Bank
Authors
Xiaolong Wu; Mingxing Xu; Askar Hamdulla; Thomas Fang Zheng
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Area covered
China
Description
Although Chinese speech affective computing has received increasing attention, existing datasets still have defects such as lack of naturalness, single pronunciation style, and unreliable annotation, which seriously hinder the research in this field. To address these issues, this paper introduces the first Chinese Natural Speech Complex Emotion Dataset (CNSCED) to provide natural data resources for Chinese speech affective computing. CNSCED was collected from publicly broadcasted civil dispute and interview television programs in China, reflecting the authentic emotional characteristics of Chinese people in daily life. The dataset includes 14 hours of speech data from 454 speakers of various ages, totaling 15777 samples. Based on the inherent complexity and ambiguity of natural emotions, this paper proposes an emotion vector annotation method. This method utilizes a vector composed of six meta-emotional dimensions (angry, sad, aroused, happy, surprise, and fear) of different intensities to describe any single or complex emotion. The CNSCED released two subtasks: complex emotion classification and complex emotion intensity regression. In the experimental section, we evaluated the CNSCED dataset using deep neural network models and provided a baseline result. To the best of our knowledge, CNSCED is the first public Chinese natural speech complex emotion dataset, which can be used for scientific research free of charge.
a
Chinese-financed Port Infrastructure
aiddata.org
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Chinese-financed Port Infrastructure [Dataset]. https://www.aiddata.org/data/chinas-official-seaport-finance-dataset-2000-2021
Explore at:
Dataset updated
Jul 25, 2023
Area covered
China
Description
This dataset tracks 123 seaport projects worth $29.9 billion officially financed by China to construct or expand 78 ports in 46 low-income and middle-income countries from 2000-2021.
N
China, TX Population Breakdown by Gender and Age Dataset: Male and Female...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). China, TX Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1d7b2bf-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
json, csvAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China, Texas
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of China by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for China. The dataset can be utilized to understand the population distribution of China by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in China. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for China.

Key observations

Largest age group (population): Male # 15-19 years (52) | Female # 20-24 years (65). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the China population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the China is shown in the following column.

Population (Female): The female population in the China is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in China for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for China Population by Gender. You can refer the same here

Facebook

Twitter

Click to copy link

Link copied

Cite

Christian Henriot (2025). Modern China Geospatial Database - Main Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5735393

Modern China Geospatial Database - Main Dataset

Explore at:

Dataset updated

Feb 28, 2025

Dataset authored and provided by

Christian Henriot

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Area covered

China

Description

MCGD_Data_V2.2 contains all the data that we have collected on locations in modern China, plus a number of locations outside of China that we encounter frequently in historical sources on China. All further updates will appear under the name "MCGD_Data" with a time stamp (e.g., MCGD_Data2023-06-21)

You can also have access to this dataset and all the datasets that the ENP-China makes available on GitLab: https://gitlab.com/enpchina/IndexesEnp

Altogether there are 464,970 entries. The data include the name of locations and their variants in Chinese, pinyin, and any recorded transliteration; the name of the province in Chinese and in pinyin; Province ID; the latitude and longitude; the Name ID and Location ID, and NameID_Legacy. The Name IDs all start with H followed by seven digits. This is the internal ID system of MCGD (the NameID_Legacy column records the Name IDs in their original format depending on the source). Locations IDs that start with "DH" are data points extracted from China Historical GIS (Harvard University); those that start with "D" are locations extracted from the data points in Geonames; those that have only digits (8 digits) are data points we have added from various map sources.

One of the main features of the MCGD Main Dataset is the systematic collection and compilation of place names from non-Chinese language historical sources. Locations were designated in transliteration systems that are hardly comprehensible today, which makes it very difficult to find the actual locations they correspond to. This dataset allows for the conversion from these obsolete transliterations to the current names and geocoordinates.

From June 2021 onward, we have adopted a different file naming system to keep track of versions. From MCGD_Data_V1 we have moved to MCGD_Data_V2. In June 2022, we introduced time stamps, which result in the following naming convention: MCGD_Data_YYYY.MM.DD.

UPDATES

MCGD_Data2025_02_28 includes a major change with the duplication of all the locations listed under Beijing, Shanghai, Tianjin, and Chongqing (北京, 上海, 天津, 重慶) and their listing under the name of the provinces to which they belonge origially before the creation of the four special municipalities after 1949. This is meant to facilitate the matching of data from historical sources. Each location has a unique NameID. Altogether there are 472,818 entries

MCGD_Data2025_02_27 inclues an update on locations extracted from Minguo zhengfu ge yuanhui keyuan yishang zhiyuanlu 國民政府各院部會科員以上職員錄 (Directory of staff members and above in the ministries and committees of the National Government). Nanjing: Guomin zhengfu wenguanchu yinzhuju 國民政府文官處印鑄局國民政府文官處印鑄局, 1944). We also made corrections in the Prov_Py and Prov_Zh columns as there were some misalignments between the pinyin name and the name in Chines characters. The file now includes 465,128 entries.

MCGD_Data2024_03_23 includes an update on locations in Taiwan from the Asia Directories. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown").

MCGD_Data2023.12.22 contains all the data that we have collected on locations in China, whatever the period. Altogether there are 465,603 entries (of which 187 place names without geocoordinates, labelled in the Lat Long columns as "Unknown"). The dataset also includes locations outside of China for the purpose of matching such locations to the place names extracted from historical sources. For example, one may need to locate individuals born outside of China. Rather than maintaining two separate files, we made the decision to incorporate all the place names found in historical sources in the gazetteer. Such place names can easily be removed by selecting all the entries where the 'Province' data is missing.

Clear search

Close search

Google apps

Main menu

Modern China Geospatial Database - Main Dataset

Chinese Macroeconomic Data (2005 - 2022)

chinese_conversation_and_spam

Global China Data

Chinese Open Ended Question Answer Text Dataset

What’s Included

Mandarin General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

2023 Contributions of Plant Specimen Data inside China

Data from: Information dataset of China’s overseas industrial parks from...

Chinese Chain of Thought Prompt & Response Dataset

What’s Included

China Unemployment Rate

Contrasting English and Chinese - Dataset - B2FIND

United States Imports from China

Chinese Shopping List OCR Image Dataset

What’s Included

THUCNews Chinese News Text Classification Dataset

instruct_chat_50k.jsonl

Number of internet users in China 2005-2024

Online data privacy and security preferences in China Q3 2024

Chinese Natural Speech Complex Emotion Dataset

Chinese-financed Port Infrastructure

China, TX Population Breakdown by Gender and Age Dataset: Male and Female...

About this dataset

Content

Inspiration

Recommended for further research

Modern China Geospatial Database - Main DatasetSee More Versions

Modern China Geospatial Database - Main Dataset