Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data Description
Gender: Balanced between males and females Accent: Mandarin accent Recording environment: Background environment is quiet, SNR>20 dB Number of participants: Around 300 people, 150 from southern China and 150 from northern China Distance from microphone: Less than 20cm Style: Normal speaking speed and volume, realistic and natural Equipment: Various types of mobile phones Content: Recording texts focus on education and entertainment topics, covering daily… See the full description on the dataset page: https://huggingface.co/datasets/longmaodata/Chinese-dialect.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description: The AI-Enhanced English and Chinese Language Learning Dataset is a comprehensive collection of data aimed at advancing language education through the use of artificial intelligence. This dataset includes detailed records from various language learning platforms, combining both traditional classroom activities and AI-driven learning applications. The dataset is suitable for exploring different AI techniques to improve English and Chinese language acquisition, focusing on adaptive learning, feedback analysis, and language practice. Data spans from February 2019 to August 2024, covering diverse language learning scenarios across multiple institutions, including digital language labs, mobile apps, and AI-powered tutoring systems.
The dataset includes hourly data collected from language learners engaging in various activities such as grammar exercises, conversational practice, writing assessments, and interactive quizzes. The data is sourced from multiple regions, including English-speaking and Mandarin-speaking communities, making it ideal for comparative studies on AI-driven learning outcomes. The records encompass a variety of linguistic features and learning metrics, offering valuable insights into student engagement, progress, and performance across different learning contexts.
Features: Timestamp: Hourly timestamp indicating the time of each learning session. Learner ID: A unique identifier for each learner. Age: The age of the learner. Gender: Gender of the learner (Male, Female, Other). Native Language: The primary language spoken by the learner. Country of Residence: The country where the learner is based. Language Proficiency Level (Initial): The learner's initial language proficiency in English or Chinese (Beginner, Intermediate, Advanced). Type of Activity: Type of learning activity (Listening, Speaking, Reading, Writing). Lesson Content Type: The specific focus of the lesson (Grammar, Vocabulary, Pronunciation, etc.). Number of Lessons Completed: Cumulative count of lessons completed by the learner. Time Spent on Learning: Total time spent on language learning (in minutes). Learning Platform or Tool Used: Platform or tool used for learning (App, Website, Classroom Software). Homework Completion Rate: Percentage of homework assignments completed. Participation in Interactive Exercises: Frequency of participation in interactive exercises like quizzes and games. Frequency of Practice Sessions: Number of practice sessions per week. Test Scores: Scores from language proficiency tests, covering various areas such as grammar, listening, and vocabulary. Speaking Fluency Scores: Scores evaluating pronunciation accuracy and speech rate. Reading Comprehension Scores: Assessment scores for reading comprehension tasks. Writing Quality: Evaluation of writing quality based on grammatical accuracy and vocabulary use. Change in Proficiency Level: Measured change in language proficiency over time. Assignment Grades: Grades received on language assignments. Error Correction Rate: The rate at which learners correct their mistakes. Feedback from Instructors/Tutors: Qualitative feedback provided by instructors or AI tutors. Study Session Duration: Average duration of study sessions. Learning Consistency: Number of days per week studied. User Activity Type: Type of user activity (Active or Passive Participation). Engagement with Additional Learning Materials: Frequency of accessing extra learning resources (e.g., videos, articles). Peer Interaction Score: Score representing participation in study groups or discussion forums. Motivation Level: Self-reported level of motivation. Learning Environment: Type of learning environment (Home, School, Language Center). Learning Mode: Mode of learning (Self-Paced or Instructor-Led). Accessibility of Learning Resources: Availability of learning materials to the learner. Use of AI Tools: Whether AI tools like chatbots or speech recognition software were used. Language Learning Goals: Purpose of language learning (Academic, Professional, Personal). This dataset offers rich data for researchers and educators to analyze the impact of AI on language learning outcomes, make cross-linguistic comparisons, and develop personalized AI-driven language education models.
Facebook
TwitterMandarin Chinese(China) Noisy Monologue Smartphone speech dataset_ Guiding, collected from monologue based on given prompts, covering generic domain, such as in-car, smart home, voice assistant, recorded in noisy condition. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(205 people), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Facebook
TwitterSpeech is essential for human communication, but millions of people lose the ability to speak due to conditions such as amyotrophic lateral sclerosis (ALS) or stroke. Assistive technologies like brain-computer interfaces (BCIs), which can convert brain signals into speech, offer hope for these patients. However, these technologies still face challenges in decoding accuracy. This issue is especially challenging for tonal languages like Mandarin Chinese. Their processing requires phoneme encoding and precise tonal information handling, which complicates the decoding of brain signals. Furthermore, most existing speech datasets are based on Indo-European languages, which hinders our understanding of how tonal information is encoded in the brain. To address this, we introduce a comprehensive open dataset, which includes multimodal signals from 30 subjects using Mandarin Chinese across overt, silent, and imagined speech modes, covering electroencephalogram (EEG), surface electromyogram (sEMG), and speech recordings. Unlike many datasets that focus on a single speech mode, this one integrates three speech modes, providing a more comprehensive view of speech-related activity. Incorporating Mandarin facilitates an in-depth examination of the inner mechanisms that encode tonal variations and their interaction with motor and auditory speech representations. This is crucial for enhancing tonal language decoding in BCIs. Beyond BCI applications, this dataset lays a valuable groundwork for exploring the neural encoding of tonal languages, investigating tone-related brain dynamics, and improving assistive communication strategies. It supports cross-linguistic speech processing research and contributes to data-driven neural speech decoding technology innovations.
Facebook
TwitterThe Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Public Use Microdata Area (PUMA), are geographies of at least 100,000 people that are nested within states or equivalent entities. States are able to delineate PUMAs within their borders, or use PUMA Criteria provided by the Census Bureau. Census tables used to gather data from the 2019- 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table B16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.
Vietnamese Source of PUMA boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_B16001_PUMAs_metadata.xlsx for full attribute loop up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.
Facebook
TwitterHmong–Mien (HM) -speaking populations, widely distributed in South China, the north of Thailand, Laos, and Vietnam, have experienced different settlement environments, dietary habits, and pathogenic exposure. However, their specific biological adaptation remained largely uncharacterized, which is important in the population evolutionary genetics and Trans-Omics for regional Precision Medicine. Besides, the origin and genetic diversity of HM people and their phylogenetic relationship with surrounding modern and ancient populations are also unknown. Here, we reported genome-wide SNPs in 52 representative Miao people and combined them with 144 HM people from 13 geographically representative populations to characterize the full genetic admixture and adaptive landscape of HM speakers. We found that obvious genetic substructures existed in geographically different HM populations; one localized in the HM clines, and others possessed affinity with Han Chinese. We also identified one new ancestral lineage specifically existed in HM people, which spatially distributed from Sichuan and Guizhou in the north to Thailand in the south. The sharing patterns of the newly identified homogenous ancestry component combined the estimated admixture times via the decay of linkage disequilibrium and haplotype sharing in GLOBETROTTER suggested that the modern HM-speaking populations originated from Southwest China and migrated southward in the historic period, which is consistent with the reconstructed phenomena of linguistic and archeological documents. Additionally, we identified specific adaptive signatures associated with several important human nervous system biological functions. Our pilot work emphasized the importance of anthropologically informed sampling and deeply genetic structure reconstruction via whole-genome sequencing in the next step in the deep Chinese Population Genomic Diversity Project (CPGDP), especially in the regions with rich ethnolinguistic diversity.
Facebook
TwitterSui people, which belong to the Tai-Kadai-speaking family, remain poorly characterized due to a lack of genome-wide data. To infer the fine-scale population genetic structure and putative genetic sources of the Sui people, we genotyped 498,655 genome-wide single-nucleotide polymorphisms (SNPs) using SNP arrays in 68 Sui individuals from seven indigenous populations in Guizhou province and Guangxi Zhuang Autonomous Region in Southwest China and co-analyzed with available East Asians via a series of population genetic methods including principal component analysis (PCA), ADMIXTURE, pairwise Fst genetic distance, f-statistics, qpWave, and qpAdm. Our results revealed that Guangxi and Guizhou Sui people showed a strong genetic affinity with populations from southern China and Southeast Asia, especially Tai-Kadai- and Hmong-Mien-speaking populations as well as ancient Iron Age Taiwan Hanben, Gongguan individuals supporting the hypothesis that Sui people came from southern China originally. The indigenous Tai-Kadai-related ancestry (represented by Li), Northern East Asian-related ancestry, and Hmong-Mien-related lineage contributed to the formation processes of the Sui people. We identified the genetic substructure within Sui groups: Guizhou Sui people were relatively homogeneous and possessed similar genetic profiles with neighboring Tai-Kadai-related populations, such as Maonan. While Sui people in Yizhou and Huanjiang of Guangxi might receive unique, additional gene flow from Hmong-Mien-speaking populations and Northern East Asians, respectively, after the divergence within other Sui populations. Sui people could be modeled as the admixture of ancient Yellow River Basin farmer-related ancestry (36.2–54.7%) and ancient coastal Southeast Asian-related ancestry (45.3–63.8%). We also identified the potential positive selection signals related to the disease susceptibility in Sui people via integrated haplotype score (iHS) and number of segregating sites by length (nSL) scores. These genomic findings provided new insights into the demographic history of Tai-Kadai-speaking Sui people and their interaction with neighboring populations in Southern China.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Guizhou Province harbors extensive ethnolinguistic and cultural diversity with Sino-Tibetan-, Hmong–Mien-, and Tai–Kadai-speaking populations. However, previous genetic analyses mainly focused on the genetic admixture history of the former two linguistic groups. The admixture history of Tai–Kadai-speaking populations in Guizhou needed to be characterized further. Thus, we genotyped genome-wide SNP data from 41 Tai–Kadai-speaking Maonan people and made a comprehensive population genetic analysis to explore their genetic origin and admixture history based on the pattern of the sharing alleles and haplotypes. We found a genetic affinity among geographically different Tai–Kadai-speaking populations, especially for Guizhou Maonan people and reference Maonan from Guangxi. Furthermore, formal tests based on the f3/f4-statistics further identified an adjacent connection between Maonan and geographically adjacent Hmong–Mien and Sino-Tibetan people, which was consistent with their historically documented shared material culture (Zhang et al., iScience, 2020, 23, 101032). Fitted qpAdm-based two-way admixture models with ancestral sources from northern and southern East Asians demonstrated that Maonan people were an admixed population with primary ancestry related to Guangxi historical people and a minor proportion of ancestry from Northeast Asians, consistent with their linguistically supported southern China origin. Here, we presented the landscape of genetic structure and diversity of Maonan people and a simple demographic model for their evolutionary process. Further whole-genome-sequence–based projects can be presented with more detailed information about the population history and adaptative history of the Guizhou Maonan people.
Facebook
TwitterThe Delaware Valley Regional Planning Commission (DVRPC) is committed to upholding the principles and intentions of the 1964 Civil Rights Act and related nondiscrimination statutes in all of the Commission’s work, including publications, products, communications, public input, and decision-making processes. Language barriers may prohibit people who are Limited in English Proficiency (also known as LEP persons) from obtaining services, information, or participating in public planning processes. To better identify LEP populations and thoroughly evaluate the Commission’s efforts to provide meaningful access, DVRPC has produced this Limited-English Proficiency Plan. This is the data that was used to make the maps for the upcoming plan. Census tables used to gather data from the 2019 - 2023 American Community Survey 5-Year Estimates ACS 2019-2023, Table C16001: Language Spoken at Home by Ability to Speak English for the Population 5 Years and Over. ACS data are derived from a survey and are subject to sampling variablity.
Vietnamese Source of tract boundaries: US Census Bureau. The TIGER/Line Files Please refer to U:_OngoingProjects\LEP\ACS_5YR_C16001_LEP_metadata.xlsx for full attribute look up and fields used in making the DVRPC LEP Map Series. Please contact Chris Pollard (cpollard@dvrpc.org) should you have any questions about this dataset.
Facebook
TwitterThe equal access accommodations Indicator is measured by comparing the percent of public contact position (PCP) employees who speak Spanish to the percent of Spanish speakers who have limited English proficiency (LEP) citywide. The Equal Access to Services Ordinance includes a requirement for City departments to offer bilingual services based on citywide demographics. In FY2016-2017, the two languages required by the ordinance were Spanish and Chinese. We chose to measure Spanish-speaking PCP employees for this Indicator because Spanish speakers comprise a larger proportion of the population.
This is a dataset hosted by the city of Oakland in California. The organization has an open data platform found here and they update their information according to the amount of data that is brought in. Explore Oakland's Data using Kaggle and all of the data sources available through the city of Oakland organization page!
This dataset is maintained using Socrata's API and Kaggle's API. Socrata has assisted countless organizations with hosting their open data and has been an integral part of the process of bringing more data to the public.
Cover photo by LuÃs Eusébio on Unsplash
Unsplash Images are distributed under a unique Unsplash License.
This dataset is distributed under NA
Facebook
TwitterFreezing of gait is a disabling symptom with a complex episodic nature that is frequently experienced by people with Parkinson's disease (PD). Although China has the largest population with PD in the world, no Chinese version of the freezing of gait questionnaire (FOGQ), the instrument that has been most widely used to assess FOG, has yet been developed. This study aimed to translate and adapt the original version of FOGQ to create a Chinese version, the FOGQ-CH, then assess its reliability, calculate the Minimal Detectable Change (MDC) and investigate its validity. The forward-backwards translation model was adopted, and cultural adaptation included expert review and pretesting. For the reliability study, 31 Chinese native speaking patients with PD were assessed two times in a 7–10 days interval. Internal consistency and test-retest reliability of the FOGQ-CH were measured by Cronbach's alpha (Cα) and the Intraclass Correlation Coefficient (ICC). For the validity study, 34 native speakers of Chinese with PD were included. To explore the convergent validity, relationships between the FOGQ-CH and the Unified Parkinson's Disease Rating Scale Part II (UPDRS II) and Part III (UPDRS III), Timed Up and Go Test (TUGT), Timed Up and Go Test in cognitive task (TUGT-Cog), walking speed (10 MWT speed), and step length (10 MWT step length) in a 10-m Walk Test were tested. To explore predictive validity, the number of falls followed up for 6 months were assessed. The area under the ROC curve (AUC) was employed to test the capacity of FOGQ-CH to discriminate those with falls. From the reliability study, Cα = 0.823, ICC = 0.786. The MDC0.90 = 4.538. From the validity study, the FOGQ-CH showed moderate correlations with UPDRS II (rho = 0.560, p = 0.001), UPDRS III (rho = 0.451, p = 0.007), TUGT (rho = 0.556, p = 0.007), TUGT-Cog (rho = 0.557, p = 0.001), 10MWT-speed (rho = −0.478, p = 0.004), 10MWT-step length (rho = −0.419, p = 0.014), and the number of falls followed up for 6 months (rho = 0.356, p = 0.045). The AUC = 0.777 (p = 0.036) for predicting whether the participants will have multiple falls (two or more) in the following 6 months. The FOGQ-CH showed good reliability and validity for assessing Chinese native speaking patients with PD. In addition, the FOGQ-CH showed good efficacy for predicting multiple falls in the following 6 months.
Facebook
TwitterAdditional file 2: Table S2. The pairwise Fst values (below diagonal lines) and p-values (above diagonal lines) of Guizhou Buyei population and 53 other published populations worldwide. The paired Fst and p-values of the Guizhou Buyei population and 13 other published populations in China were calculated. The genetic differentiation between the Guizhou Buyei and Guizhou Miao was the smallest (with the closest genetic affinity, Fst= 0.01508), followed by the Henan Han nationality (Fst= 0.01799). The genetic distance between the northwest Hui and Guizhou Buyei was the largest (with the farthest genetic affinity, Fst= 0.05908).The paired Fst genetic distance and correlation coefficient p-values between Guizhou Buyei nationality and 40 other reference populations in the world (except China) showed that Guizhou Buyei nationality and Pakistan Hazara nationality had the smallest genetic distance (with the closest genetic affinity, Fst= 0.01783), followed by Kashmiri (Fst= 0.02084), and had the largest genetic differentiation (with the farthest genetic affinity, Fst= 0.12165) with the Gdansk people in Poland.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This study examined the spoken narrative skills of a group of bilingual Mandarin–English speaking 3–6-year-olds (N = 25) in Australia, using a remote online story-retell task. Bilingual preschoolers are an understudied population, especially those who are speaking typologically distinct languages such as Mandarin and English which have fewer structural overlaps compared to language pairs that are typologically closer, reducing cross-linguistic positive transfer. We examined these preschoolers’ spoken narrative skills as measured by macrostructures (the global organization of a story) and microstructures (linguistic structures, e.g., total number of utterances, nouns, verbs, phrases, and modifiers) across and within each language, and how various factors such as age and language experiences contribute to individual variability. The results indicate that our bilingual preschoolers acquired spoken narrative skills similarly across their two languages, i.e., showing similar patterns of productivity for macrostructure and microstructure elements in both of their two languages. While chronological age was positively correlated with macrostructures in both languages (showing developmental effects), there were no significant correlations between measures of language experiences and the measures of spoken narrative skills (no effects for language input/output). The findings suggest that although these preschoolers acquire two typologically diverse languages in different learning environments, Mandarin at home with highly educated parents, and English at preschool, they displayed similar levels of oral narrative skills as far as these macro−/micro-structure measures are concerned. This study provides further evidence for the feasibility of remote online assessment of preschoolers’ narrative skills.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clickers might own a bright future in China if properly introduced although they have not been widely acknowledged as an effective tool to facilitate English learning and teaching in Chinese contexts. By randomly selecting participants from undergraduates in a university in China over four academic years, this study aims to identify the impact of clickers on college English listening and speaking skills, and differences in cognitive loads between clickers and traditional multimedia assisted instruction modes. It was concluded that in China's college English class, compared with multimedia assisted instruction, (1) clickers could improve college English listening skills; (2) clickers could improve college English speaking skills; and (3) clickers could reduce undergraduates' cognitive loads in College English Class. Reasons for the results and defects in this study were also explored and discussed, based on learning, teaching and cognitive load theories. Some Suggestions for future research were also raised.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Note: Three of our subjects (A, B, C) all speak fluent Taiwanese. Their Japanese and Mandarin Chinese level had degenerated to low level. In case A and C, strong emotion combined with language they perceived. The meaning of symbols is as following.○: fluentΔ: Low fluency, with only partially understanding and production▲: Use Japanese only in counting numbers×: Very low fluency+~+++: Emotional response to languages besides mother language (Taiwanese).Language ability and related emotion triggered by language.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dong people in Southwest China are officially recognised as an ethnic group, but there has been a lack of population genetic research on this group, especially based on mitochondrial DNA data. To study the sequences and haplogroups of the mitochondrial DNA control region in a typical Dong population, and to provide help for the construction of a forensic mitochondrial DNA analysis reference database in East Asia. The sequences of the mitochondrial DNA control region were analysed in 200 individuals of Dong in Guizhou. The haplotype frequencies, haplogroup distribution and paired Fst values of Guizhou Dong and 51 other populations in the world were calculated and explained to explore the genetic polymorphism and population relationships. A total of 180 haplotypes were detected, with frequencies of 0.005–0.02. All haplotypes were assigned to 97 different haplogroups. The haplotype diversity and random matching probability were 0.998643 and 0.00635, respectively. The paired Fst values and correlation p values of 52 populations showed that the Guizhou Dong had the closest genetic relationship with the Henan Han and the Guizhou Miao in China, and were closest to the Punjab population in Pakistan and the Kashmiri population when compared with the world populations. Our study was based on the matrilineal genetic structure of Guizhou Dong to study mitochondrial DNA, which was helpful to promote the establishment of the forensic DNA reference database in East Asia and provide reference for anthropological research.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Data Description
Gender: Balanced between males and females Accent: Mandarin accent Recording environment: Background environment is quiet, SNR>20 dB Number of participants: Around 300 people, 150 from southern China and 150 from northern China Distance from microphone: Less than 20cm Style: Normal speaking speed and volume, realistic and natural Equipment: Various types of mobile phones Content: Recording texts focus on education and entertainment topics, covering daily… See the full description on the dataset page: https://huggingface.co/datasets/longmaodata/Chinese-dialect.