In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.
As of February 2025, English was the most popular language for web content, with over 49.4 percent of websites using it. Spanish ranked second, with six percent of web content, while the content in the German language followed, with 5.6 percent. English as the leading online language United States and India, the countries with the most internet users after China, are also the world's biggest English-speaking markets. The internet user base in both countries combined, as of January 2023, was over a billion individuals. This has led to most of the online information being created in English. Consequently, even those who are not native speakers may use it for convenience. Global internet usage by regions As of October 2024, the number of internet users worldwide was 5.52 billion. In the same period, Northern Europe and North America were leading in terms of internet penetration rates worldwide, with around 97 percent of its populations accessing the internet.
In 2023, around 43.37 million people in the United States spoke Spanish at home. In comparison, approximately 998,179 people were speaking Russian at home during the same year. The distribution of the U.S. population by ethnicity can be accessed here. A ranking of the most spoken languages across the world can be accessed here.
Mexico is the country with the largest number of native Spanish speakers in the world. As of 2024, 132.5 million people in Mexico spoke Spanish with a native command of the language. Colombia was the nation with the second-highest number of native Spanish speakers, at around 52.7 million. Spain came in third, with 48 million, and Argentina fourth, with 46 million. Spanish, a world language As of 2023, Spanish ranked as the fourth most spoken language in the world, only behind English, Chinese, and Hindi, with over half a billion speakers. Spanish is the official language of over 20 countries, the majority on the American continent, nonetheless, it's also one of the official languages of Equatorial Guinea in Africa. Other countries have a strong influence, like the United States, Morocco, or Brazil, countries included in the list of non-Hispanic countries with the highest number of Spanish speakers. The second most spoken language in the U.S. In the most recent data, Spanish ranked as the language, other than English, with the highest number of speakers, with 12 times more speakers as the second place. Which comes to no surprise following the long history of migrations from Latin American countries to the Northern country. Moreover, only during the fiscal year 2022. 5 out of the top 10 countries of origin of naturalized people in the U.S. came from Spanish-speaking countries.
As of 2024, JavaScript and HTML/CSS were the most commonly used programming languages among software developers around the world, with more than 62 percent of respondents stating that they used JavaScript and just around 53 percent using HTML/CSS. Python, SQL, and TypeScript rounded out the top five most widely used programming languages around the world. Programming languages At a very basic level, programming languages serve as sets of instructions that direct computers on how to behave and carry out tasks. Thanks to the increased prevalence of, and reliance on, computers and electronic devices in today’s society, these languages play a crucial role in the everyday lives of people around the world. An increasing number of people are interested in furthering their understanding of these tools through courses and bootcamps, while current developers are constantly seeking new languages and resources to learn to add to their skills. Furthermore, programming knowledge is becoming an important skill to possess within various industries throughout the business world. Job seekers with skills in Python, R, and SQL will find their knowledge to be among the most highly desirable data science skills and likely assist in their search for employment.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Please cite the following paper when using this dataset:
N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)
Abstract
The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.
For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.
The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)
There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)
The following is a description of the attributes present in this dataset
Post ID: Unique ID of each Instagram post
Post Description: Complete description of each post in the language in which it was originally published
Date: Date of publication in MM/DD/YYYY format
Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral
Open Research Questions
This dataset is expected to be helpful for the investigation of the following research questions and even beyond:
How does sentiment toward COVID-19 vary across different languages?
How has public sentiment toward COVID-19 evolved from 2020 to the present?
How do cultural differences affect social media discourse about COVID-19 across various languages?
How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
How effective were public health campaigns in shifting public sentiment in different languages?
What patterns of vaccine hesitancy or support are present in different languages?
How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?
All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks annual reading and language arts proficiency from 2010 to 2022 for Top Of The World Elementary School vs. California and Laguna Beach Unified School District
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of descriptive data about vowel dissimilation patterns across 116 languages and 133 unique patterns, representing an array of 38 linguistic families, including Afro-Asiatic, Austronesian, Indo-European, Mayan, Niger-Congo, Sino-Tibetan, Turkic, as well as isolates like Basque and Ainu. The largest representation within the language families is for the Austronesian family, represented with 28 patterns observed in 23 languages, followed by 17 patterns in 15 Indo-European languages, 10 patterns in 10 Atlantic-Congo and 8 Afro-Asiatic languages, respectively, and 8 distinct patterns found in 7 Mayan languages. Other smaller language families are also represented and make up half of the total patterns collected in the survey. Linguistic isolates are represented with 5 patterns in 3 languages.
The dissimilation is signaled from the input-output pairs and the morpho-syntactic context in which the pattern is noticed (a specific morpheme, group of morphemes, or unlimited with respect to the morpho-syntax). The information and the examples are sourced from grammatical descriptions, primarily reference and descriptive grammars, but also dictionaries, wordlists, corpora, various online materials like forums, song lyrics, news portals, magazines, religious texts. Valuable sources are phonological descriptions and phonological studies on individual languages as well as descriptive and theoretical papers offering analyses of dissimilative patterns in various frameworks.
Genetic information for individual languages is sourced from Glottolog v. 5.2, supplemented with information from the sources themselves, where necessary. For example, in some cases the language name in Glottolog is different from the name in the source, in which case priority is given to the source. Linguistics systems are presented in an alphabetical order according to the major name followed by the modifier. This means that variants of the same major linguistic system are presented after one another, like in the case of Basque, Guere etc.
Every linguistic system / languoid is identified with an ISO-3 code or Glottocode (if the ISO-3 Code is not available, usually for smaller variants), language family, area(s) where spoken and the list of sources the data are retrieved from. The general information is followed by the marker `phonological' or `morpho-phonological' depending on the observed nature of the pattern. Next is the information on the dissimilative regularity and the morpho-syntactic context in which the pattern functions, followed by the data, represented as lists of examples showing the regular pattern in contrast to dissimilative, including notes about exceptions and general phonological tendencies in the language. The amount of data available is sadly not uniform and is in several cases scarce. In some cases only the representative examples are available and in some all of the available data are taken into account, even if that meant the pattern is represented with five examples.
glottocode
- Unique language identifier from Glottolog v. 5.2 (e.g., adyg1241
)
language.x
- Language name (e.g., "Adyghe")
iso.x
- ISO 639-3 code (e.g., ady
)
family
- Language family (e.g., "Abkhaz-Adyge")
subfamily
- Subgroup (e.g., "Circasian")
language_glottolog
- Glottolog's standardized language name
language_glottolog.1
- Secondary Glottolog reference
iso.y
- Alternate ISO code (if different from iso.x
)
level
- Language/dialect classification ("language" or "dialect")
area
- Macro-region (e.g., "Eurasia", "Africa")
latitude
- Decimal degrees
longitude
- Decimal degrees
countries
- ISO country codes (e.g., "RU;TR")
VD.type
- Pattern type (P = phonological, MP = morpho-phonological)
feature.INPUT
- Underlying vowel feature (e.g. [+low]
)
feature.OUTPUT
- Resulting feature (e.g. [-low]
)
feature.CONTEXT
- Phonological context triggering change
other.features
- Additional relevant features (e.g. [+round]
)
type.of.identity
- What kind of identity is necessary for dissimilation ("full" or "partial")
vowel.length
- Sensitivity to vowel length ("no", "feeds", "bleeds")
adjacent
- Locality condition ("syllable", "root node", "foot", "unlimited", "variable")
morphemes.involved
- Morpho-syntactic context (e.g., "pl", "poss")
another
- Secondary morpheme category (if applicable)
class
- Word class affected ("noun", "verb", "both")
direction
- "regressive" or "progressive" dissimilation
trigger
- From where dissimilative originate ("prefix", "suffix", "root")
location
- Locus of change ("root", "suffix", etc.)
prosody.related
- Stress/tone involvement ("yes"/"no")
alternative
- Alternative value to dissimilative (e.g. "default", "harmony", "reduplication")
feature_change
- Descriptive string (e.g. [[+low]] → [[-low]]
)
morpheme_categories
- Grammatical categories (e.g. "pers/num")
affiliation
- Language family with sub-branches
subclassification
- Detailed genealogical tree
countries
- Repeat of ISO country codes
kase1253 | Kasem | xsm | Atlantic-Congo | Grusi | MP | [-low] | [+low] | [+high] | [+round] | partial | feeds | syllable | pl | no | noun | regressive | suffix | root | no | default | Kasem | Kasem | xsm | language | Africa | 11.0824 | -1.39076 | BF;GH | Atlantic-Congo, Volta-Congo, North Volta-Congo, Gur, Central Gur, Southern Central Gur, Grusi, Northern Grusi, Nuna-Kasem | (East_Kasem:1,Fere:1,Lela:1,Nuclear_Kasem:1,Nunuma:1,West_Kasem:1)kase1253:1; | [[-low]] → [[+low]] | pl |
The statistic reflects the distribution of languages in Canada in 2022. In 2022, 87.1 percent of the total population in Canada spoke English as their native tongue.
https://www.cognitivemarketresearch.com/privacy-policyhttps://www.cognitivemarketresearch.com/privacy-policy
According to Cognitive Market Research, the global Language Learning App market size is USD 3258.2 million in 2024 and will expand at a compound annual growth rate (CAGR) of 17.00% from 2024 to 2031.
North America held the major market of more than 40% of the global revenue with a market size of USD 1303.3 million in 2024 and will grow at a compound annual growth rate (CAGR) of 15.2% from 2024 to 2031.
Europe accounted for a share of over 30% of the global market size of USD 977.5 million.
Asia Pacific held the market of around 23% of the global revenue with a market size of USD 749.4 million in 2024 and will grow at a compound annual growth rate (CAGR) of 19.0% from 2024 to 2031.
Latin America market of more than 5% of the global revenue with a market size of USD 162.9 million in 2024 and will grow at a compound annual growth rate (CAGR) of 16.4% from 2024 to 2031.
Middle East and Africa held the market of around 2% of the global revenue with a market size of USD 65.2 million in 2024 and will grow at a compound annual growth rate (CAGR) of 16.7% from 2024 to 2031.
The Offline Type held the highest Language Learning App market revenue share in 2024.
Market Dynamics of Language Learning App Market
Key Drivers for Language Learning App Market
Rise in Online Education to Increase the Demand Globally
The rise in online education is a significant driver of the Language Learning App Market. As traditional education moves towards digital platforms, there is a growing demand for convenient and accessible language learning solutions. Language learning apps offer flexibility, allowing users to learn at their own pace and schedule, without the constraints of traditional classroom settings. Moreover, the availability of a wide range of languages and learning materials on these apps caters to diverse learner needs and preferences. Additionally, the interactive and engaging nature of language learning apps, with features such as quizzes, games, and multimedia content, enhances user engagement and motivation. With the increasing popularity of online education, language learning apps are poised to experience continued growth as essential tools for language acquisition in the digital age.
Increasing Focus on Skill Development to Propel Market Growth
The increasing focus on skill development, particularly in an ever-evolving job market, is driving the Language Learning App Market. Employers increasingly value language proficiency as a valuable skill, especially in multinational and diverse workplaces. As a result, individuals are seeking efficient and accessible ways to acquire new language skills to enhance their employability and career prospects. Language learning apps offer a convenient and flexible solution, allowing users to learn languages at their own pace and convenience. Moreover, these apps often provide interactive and engaging learning experiences, with features such as quizzes, games, and real-world scenarios, making language learning more enjoyable and effective. With the growing emphasis on skill development, the demand for language learning apps is expected to continue to rise as individuals seek to expand their linguistic abilities for personal and professional growth.
Restraint Factor for the Language Learning App Market
Engagement and Retention
Engagement and retention pose significant challenges in the Language Learning App Market. Sustaining user interest and motivation over the long term is essential for effective language acquisition, yet many users experience difficulty maintaining consistency in their learning habits. Language learning can be a daunting and time-consuming endeavor, leading to user fatigue and drop-off rates. Additionally, competing demands on users' time and attention, as well as the abundance of alternative learning resources, further exacerbate these challenges. Language learning apps must continuously innovate to provide engaging and personalized learning experiences, incorporating features such as gamification, social interaction, and progress tracking to enhance user engagement and retention. Moreover, effective communication strategies and targeted interventions are necessary to re-engage users who may become disengaged or inactive, thereby improving overall retention rates in the Language Learning App Market.
Impact of Covid-19 on the Language Learning App Market
The COVID-19 pandemic has had a profound impact on the Language Learni...
https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy
The global instructor-led language training market size was valued at approximately USD 8 billion in 2023 and is projected to grow significantly, reaching nearly USD 12.5 billion by 2032, with a compound annual growth rate (CAGR) of around 5%. The growth of this market is being driven by several factors, including the increasing globalization of businesses, the rising demand for multilingual employees, and the growing emphasis on effective communication skills in both personal and professional settings. As the world becomes more interconnected, the ability to communicate in multiple languages is increasingly seen as a valuable asset, leading to a surge in demand for language training programs.
One of the primary growth factors for the instructor-led language training market is the globalization of businesses and the need for companies to operate effectively across different linguistic and cultural contexts. As companies expand their operations into new regions, the ability to communicate with local clients, partners, and employees becomes crucial. This has led to a growing demand for language training programs that can equip employees with the necessary language skills. Moreover, the rise of remote work and virtual teams has further emphasized the need for effective communication across diverse geographies, fueling the demand for language training.
Another significant factor contributing to the growth of this market is the increasing emphasis on personal development and lifelong learning. In a rapidly changing world, individuals are increasingly seeking to enhance their skills and knowledge to remain competitive in the job market. Language learning is seen as a key component of personal development, providing individuals with the ability to connect with different cultures and communities. As a result, there is a growing demand for language training programs that are tailored to individual learning needs and preferences, offering flexibility and convenience.
The rise of digital technology and the increasing availability of online learning platforms have also played a crucial role in the growth of the instructor-led language training market. While traditional in-person language classes remain popular, virtual language training programs have gained significant traction due to their convenience and accessibility. These programs allow learners to access high-quality language instruction from anywhere in the world, making language learning more accessible to a wider audience. The integration of technology in language training programs has also enabled the development of innovative teaching methodologies and interactive learning experiences, further driving the growth of this market.
In the context of globalization and the increasing need for multilingual communication, Study Abroad Training has emerged as a crucial component in language education. This type of training provides learners with immersive experiences in foreign countries, allowing them to practice language skills in real-world settings while gaining cultural insights. Study Abroad Training not only enhances language proficiency but also broadens learners' perspectives, making them more adaptable and culturally aware. As more students and professionals seek international exposure, the demand for Study Abroad Training is expected to rise, contributing to the growth of the language training market. This trend highlights the importance of experiential learning in achieving language fluency and intercultural competence.
Regionally, the instructor-led language training market is experiencing significant growth across various parts of the world. North America and Europe are currently the largest markets for language training, driven by the presence of a large number of multinational companies and a strong emphasis on language education. However, the Asia Pacific region is expected to witness the highest growth during the forecast period, driven by the rapid economic development in countries like China and India and the increasing demand for English language proficiency. The growing importance of language skills in Latin America and the Middle East & Africa is also expected to contribute to the growth of the instructor-led language training market in these regions.
The instructor-led language training market is segmented by training type into in-person and virtual training. In-person training remains a traditio
Papua New Guinea is the most linguistically diverse country in the world. As of 2025, it was home to 840 different languages. Indonesia ranked second with 709 languages spoken. In the United States, 335 languages were spoken in that same year.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Finnish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Finnish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Finnish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Finnish speech models that understand and respond to authentic Finnish accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Finnish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Finnish speech and language AI applications:
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the French General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of French speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world French communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade French speech models that understand and respond to authentic French accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of French. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple French speech and language AI applications:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Release Date: 17.01.22
Welcome to Common Phone 1.0
Legal Information
Common Phone is a subset of the Common Voice corpus collected by Mozilla Corporation. By using Common Phone, you agree to the Common Voice Legal Terms. Common Phone is maintained and distributed by speech researchers at the Pattern Recognition Lab of Friedrich-Alexander-University Erlangen-Nuremberg (FAU) under the CC0 license.
Like for Common Voice, you must not make any attempt to identify speakers that contributed to Common Phone.
About Common Phone
This corpus aims to provide a basis for Machine Learning (ML) researchers and enthusiasts to train and test their models against a wide variety of speakers, hardware/software ecosystems and acoustic conditions to improve generalization and availability of ML in real-world speech applications.
The current version of Common Phone comprises 116,5 hours of speech samples, collected from 11.246 speakers in 6 languages:
Language |
Speakers |
Hours |
---|---|---|
|
| |
English |
4716 / 771 / 774 |
14.1 / 2.3 / 2.3 |
French |
796 / 138 / 135 |
13.6 / 2.3 / 2.2 |
German |
1176 / 202 / 206 |
14.5 / 2.5 / 2.6 |
Italian |
1031 / 176 / 178 |
14.6 / 2.5 / 2.5 |
Spanish |
508 / 88 / 91 |
16.5 / 3.0 / 3.1 |
Russian |
190 / 34 / 36 |
12.7 / 2.6 / 2.8 |
Total |
8417 / 1409 / 1420 |
85.8 / 15.2 / 15.5 |
Presented train
, dev
and test
splits are not identical to those shipped with Common Voice. Speaker separation among splits was realized by only using those speakers that had provided age and gender information. This information can only be provided as a registered user on the website. When logged in, the session ID of contributed recordings is always linked to your user, thus we could easily link recordings to individual speakers. Keep in mind this would not be possible for unregistered users, as their session ID changes if they decide to contribute more than once.
During speaker selection, we considered that some speakers had contributed to more than one of the six Common Voice datasets (one for each language). In Common Phone, a speaker will only appear in one language.
The dataset is structured as follows:
Where does the phonetic annotation come from?
Phonetic annotation was computed via BAS Web Services. We used the regular Pipeline (G2P-MAUS) without ASR to create an alignment of text transcripts with audio signals. We chose International Phonetic Alphabet (IPA) output symbols as they work well even in a multi-lingual setup. Common Phone annotation comprises 101 phonetic symbols, including silence.
Why Common Phone?
Is there any publication available?
Yes, a paper describing Common Phone in detail is currently under revision for LREC 2022. You can access a pre-print version on arXiv entitled “Common Phone: A Multilingual Dataset for Robust Acoustic Modelling”.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card for "x-fact"
Dataset Description
Dataset Summary
X-FACT is a multilingual dataset for fact-checking with real world claims. The dataset contains short statments in 25 languages with top five evidence documents retrieved by performing google search with claim statements. The dataset contains two additional evaluation splits (in addition to a traditional test set): ood and zeroshot. ood measures out-of-domain generalization where while the language… See the full description on the dataset page: https://huggingface.co/datasets/utahnlp/x-fact.
Using data from reports such as the "English Proficiency Index" (EDU) from Education First, one can see the significant impact of culture, education and globalization on the ability of citizens of different countries to speak English.
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mandarin Chinese General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Mandarin speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mandarin Chinese communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Mandarin speech models that understand and respond to authentic Chinese accents and dialects.
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mandarin Chinese. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
The dataset comes with granular metadata for both speakers and recordings:
Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
This dataset is a versatile resource for multiple Mandarin speech and language AI applications:
Language Services Market Size 2025-2029
The language services market size is forecast to increase by USD 26.18 billion at a CAGR of 7.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the increasing adoption of technology to enhance the translation process and improve efficiency. Companies are recognizing the value of transcreation, which goes beyond simple translation to ensure cultural appropriateness and localization of content. However, the market faces challenges that require careful navigation. Regulatory hurdles impact adoption, particularly in industries with stringent compliance requirements. Supply chain inconsistencies also temper growth potential, as companies seek to maintain quality and meet tight deadlines. To capitalize on market opportunities and navigate these challenges effectively, companies must invest in building a team of skilled professionals with expertise in language services and cultural nuances. One key factor driving this growth is the increasing use of machine learning and AI-driven tools, which allow for faster processing and more reliable translations. Social networking sites have become essential platforms for businesses to engage with diverse audiences, necessitating translation services.
Strategic partnerships and technology integration can also help streamline operations and improve overall competitiveness. In summary, the market is poised for growth, with technology adoption and transcreation driving demand. However, regulatory hurdles and supply chain inconsistencies present challenges that require a strategic approach to talent acquisition and operational efficiency. Companies that invest in building a strong team and leveraging technology will be best positioned to capitalize on market opportunities and maintain a competitive edge.
What will be the Size of the Language Services Market during the forecast period?
Request Free Sample
In the dynamic global marketing landscape, effective language services have become essential for businesses aiming to expand their reach. The market encompasses various offerings, including language analysis, cultural analysis, language engineering, and linguistic analysis, among others. These services enable businesses to navigate language and cultural nuances, ensuring accurate language data and user interface localization. Language automation and integration have gained significant traction, streamlining translation workflows and language testing processes. Language certification and audiovisual translation are also crucial components, ensuring language and technology alignment for global customer support and multilingual marketing initiatives. Language analytics and localization APIs play a pivotal role in managing language data and optimizing localization workflows. Localization services cater to IT and telecommunications, BFSI, and artificial intelligence (AI) driven approaches in the global marketplace.
Content localization, game localization, and multilingual customer support are essential for businesses seeking to engage with diverse audiences. Language management systems and translation management platforms facilitate efficient language and culture adaptation, while language acquisition and global communication are ongoing priorities for businesses aiming to expand their market presence. Language segmentation and translation workflows are continuously evolving, with a focus on improving user experience and fostering effective business interactions across borders. Ultimately, language and technology convergence is driving innovation and growth in the market. AI, machine learning, and an AI-driven approach enhance translation and interpretation services for e-commerce companies, product descriptions, user interfaces, customer engagement, and industries like automotive and collaborative platforms.
How is this Language Services Industry segmented?
The language services industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Service
Translation
Interpretation
Others
End-user
Healthcare
ICT
BFSI
Government
Others
Learning Method
Online
Offline
Geography
North America
US
Canada
Europe
France
Germany
Italy
UK
APAC
China
India
Japan
South Korea
Rest of World (ROW)
By Service Insights
The translation segment is estimated to witness significant growth during the forecast period. The market is experiencing significant growth due to the increasing globalization of businesses and the prevalence of digital communication. This trend is driven by the need for companies to connect with diverse audiences in multiple languages, leading to an increased demand for
https://www.ibisworld.com/about/termsofuse/https://www.ibisworld.com/about/termsofuse/
Translation services have enjoyed solid growth through the past five years as domestic clients looked to diversify revenue streams by capturing demand in new foreign markets. Tensions abroad have highlighted the need for interpreters to help government officials broker trade deals and diplomatic agreements. The US's relationship with China has continued to deteriorate, boosting revenue for translators working in various East Asian languages. Overall, industry revenue has climbed at a CAGR of 1.1% to an estimated $8.4 billion over the five years through 2024. Advancing technology has distinctly impacted the industry, though real human translators remain the most accurate option for downstream clients. Despite translation software's efficiency and accessibility, well-resourced companies and governments still prefer the culturally nuanced and highly accurate language skills translation services provide. However, the rise of technologies like neural machine translation (NMT) and other real-time translation software threatens the industry. Still, the industry has adapted by integrating machine-assisted translation tools to speed up delivery, cut labor costs and boost profit by enabling their staff to edit software-translated texts instead of starting from scratch. Translation services are expected to see slightly stronger growth through the next period as tensions with China and its allies keep demand strong even amid declining defense spending. Recent changes to the Affordable Care Act will directly boost the presence of translators in the healthcare sector, bringing solid growth from one of the industry's top markets. Persisting immigration rates and rising globalization will keep legal document translation and webpage localization vital to the success of many companies, and shifting trade policies will make translators essential to government officials looking to consolidate trading partners. Ultimately, industry revenue is set to climb at a CAGR of 1.4% to an estimated $9.0 billion through the end of 2029.
In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.