100+ datasets found
  1. The most spoken languages worldwide 2025

    • statista.com
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
    Explore at:
    Dataset updated
    Apr 14, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2025
    Area covered
    World
    Description

    In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

  2. Top Languages Spoken in the United States

    • kaggle.com
    Updated Oct 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2022). Top Languages Spoken in the United States [Dataset]. https://www.kaggle.com/datasets/thedevastator/top-languages-spoken-in-the-united-states/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 22, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    Area covered
    United States
    Description

    Top Languages Spoken in the United States

    The Impact of linguistics on Community and Business in America

    About this dataset

    Languages are an important part of daily life in the USA. Here is a table that shows the most common languages spoken in the USA, as well as a big spreadsheet which shows each CBSA (Core-Based Statistical Area, or urban area).

    Language usage varies widely throughout the United States. According to the latest census data, over 350 different languages are represented in homes across the country. The following table and spreadsheet provide more detailed information on language usage throughout the various states and cities in the US:

    Columns: - index: Index column for dataframe - Table with column headers in row 5 and row headers in column A: Contains language data for each CBSA (Core Based Statistical Area) - Unnamed: 1: Rank of CBSA by total number of speakers of all languages - Unnamed: 2: Name of CBSA - Unnamed: 3: Population of CBSA - Unnamed: 4: Percent of population that speaks English very well - Unnamed: 5 through Unnamed: 58 : Languages spoken by at least 0.1% of the population, with corresponding percentages

    How to use the dataset

    1. This dataset can be used to understand the linguistic diversity of the United States, and to compare languages spoken across different states and cities.
    2. This data can also be used to explore trends in language usage over time.
    3. businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and tailor their marketing or customer service accordingly.
    4. Schools could use this dataset to plan language-learning programs based on the needs of their community.
    5. Policymakers could use this data to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Research Ideas

    1. Businesses can use this dataset to identify which languages are most commonly spoken in the areas in which they operate and cater their marketing or customer service accordingly.
    2. Schools could use this data to plan language-learning programs based on the needs of their community.
    3. Policymakers could use this dataset to better understand linguistic diversity in the United States and design programs to support bilingualism or multilingualism

    Acknowledgements

    This dataset was created by Gary Hoover. The data was sourced from https://www.kaggle.com/garyhoov/us-languages

    License

    Unknown License - Please check the dataset description for more information.

    Columns

    File: Languages Spoken at Home by Urban Area = CBSA.csv

    File: US Languages Spoken at Home 2014.csv | Column name | Description | |:-------------------------------------------------------------------|:--------------| | Table with column headers in row 5 and row headers in column A | |

  3. d

    Population and Languages of the Limited English Proficient (LEP) Speakers by...

    • catalog.data.gov
    • data.cityofnewyork.us
    Updated Jan 19, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.cityofnewyork.us (2024). Population and Languages of the Limited English Proficient (LEP) Speakers by Community District [Dataset]. https://catalog.data.gov/dataset/population-and-languages-of-the-limited-english-proficient-lep-speakers-by-community-distr
    Explore at:
    Dataset updated
    Jan 19, 2024
    Dataset provided by
    data.cityofnewyork.us
    Description

    Many residents of New York City speak more than one language; a number of them speak and understand non-English languages more fluently than English. This dataset, derived from the Census Bureau's American Community Survey (ACS), includes information on over 1.7 million limited English proficient (LEP) residents and a subset of that population called limited English proficient citizens of voting age (CVALEP) at the Community District level. There are 59 community districts throughout NYC, with each district being represented by a Community Board.

  4. t

    Main language spoken - Dataset - Data Place Plymouth

    • plymouth.thedata.place
    Updated Oct 6, 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2016). Main language spoken - Dataset - Data Place Plymouth [Dataset]. https://plymouth.thedata.place/dataset/main-language-spoken-detailed-plymouth
    Explore at:
    Dataset updated
    Oct 6, 2016
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Plymouth
    Description

    Data showing the main languages spoken in Plymouth by population numbers.

  5. W

    Caribbean Netherlands; Spoken languages and main language, characteristics

    • cloud.csiss.gmu.edu
    • ckan.mobidatalab.eu
    • +3more
    Updated Jul 10, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Netherlands (2019). Caribbean Netherlands; Spoken languages and main language, characteristics [Dataset]. https://cloud.csiss.gmu.edu/uddi/dataset/60050-caribbean-netherlands-spoken-languages-and-main-language-characteristics
    Explore at:
    http://publications.europa.eu/resource/authority/file-type/atom, http://publications.europa.eu/resource/authority/file-type/jsonAvailable download formats
    Dataset updated
    Jul 10, 2019
    Dataset provided by
    Netherlands
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Caribbean Netherlands
    Description

    This table provides data on languages spoken by the population of the Caribbean Netherlands aged 15 years and older in private households. Breakdowns by sex, age and level of education are presented. These aspects are shown for the Caribbean Netherlands and also for the islands Bonaire, St Eustatius and Saba separately. The research is a sample survey. This means that the figures shown are estimates for which reliability margins apply. These margins are also included in the table. The Omnibus survey was carried out for the first time on Bonaire, Saba and St. Eustatius in 2013 during the month of June and the first week of July. For the second time the Omnibus survey was carried out on Bonaire during the months of October and November 2017, and on Saba and St. Eustatius in the period January to March 2018.

    Data available from: 2013

    Status of the figures: The figures in this table are final.

    Changes as of 4 April 2019 None, this is a new table.

    When will new figures be published? New data will be published every four years.

  6. n

    Data from: Language Spoken at Home

    • linc.osbm.nc.gov
    • ncosbm.opendatasoft.com
    csv, excel, geojson +1
    Updated Oct 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Language Spoken at Home [Dataset]. https://linc.osbm.nc.gov/explore/dataset/language-spoken-at-home/
    Explore at:
    geojson, csv, json, excelAvailable download formats
    Dataset updated
    Oct 3, 2024
    Description

    Language spoken at home and the ability to speak English for the population age 5 and over as reported by the US Census Bureau's, American Community Survey (ACS) 5-year estimates table C16001.

  7. LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER (C16001)

    • catalog.data.gov
    Updated Jan 31, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Seattle ArcGIS Online (2025). LANGUAGE SPOKEN AT HOME FOR THE POPULATION 5 YEARS AND OVER (C16001) [Dataset]. https://catalog.data.gov/dataset/language-spoken-at-home-for-the-population-5-years-and-over-c16001
    Explore at:
    Dataset updated
    Jan 31, 2025
    Dataset provided by
    https://arcgis.com/
    Description

    Table from the American Community Survey (ACS) C16001 of language spoken at home for the population 5 years and over. These are multiple, nonoverlapping vintages of the 5-year ACS estimates of population and housing attributes starting in 2010 shown by the corresponding census tract vintage. Also includes the most recent release annually.King County, Washington census tracts with nonoverlapping vintages of the 5-year American Community Survey (ACS) estimates starting in 2010. Vintage identified in the "ACS Vintage" field.The census tract boundaries match the vintage of the ACS data (currently 2010 and 2020) so please note the geographic changes between the decades. Tracts have been coded as being within the City of Seattle as well as assigned to neighborhood groups called "Community Reporting Areas". These areas were created after the 2000 census to provide geographically consistent neighborhoods through time for reporting U.S. Census Bureau data. This is not an attempt to identify neighborhood boundaries as defined by neighborhoods themselves.Vintages: 2010, 2015, 2020, 2021, 2022, 2023ACS Table(s): C16001Data downloaded from: <a href='https://data.census.gov/' style='color:rgb(0, 97, 155); text

  8. S

    UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages

    • scidb.cn
    Updated Mar 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia (2025). UGSpeechData: A Multilingual Speech Dataset of Ghanaian Languages [Dataset]. http://doi.org/10.57760/sciencedb.22298
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 26, 2025
    Dataset provided by
    Science Data Bank
    Authors
    Wiafe, Isaac; Abdulai, Jamal-Deen; Ekpezu, Akon Obu; Helegah, Raynard Dodzi; Atsakpo, Elikem Doe; Nutrokpor, Charles; Winful, Fiifi Baffoe Payin; Solaga, Kafui Kwashia
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Area covered
    Ghana
    Description

    The UGSpeechData is a collection of audio speech data of Akan, Ewe, Dagaare, Dagbani, and Ikposo. These languages are among the most spoken languages in Ghana. The uploaded dataset contains a total of 970148 audio files (5384.28 hours) and 93262 transcribed audio files (518 hours). The audio files are descriptions of 1000 culturally relevant images collected from indigenous speakers of each of the languages. Each audio is between 15 to 30 seconds long. More specifically, the dataset contains five subfolders for each of the five respective languages. Each language has at least 1000 hours of speech data and 100 hours of transcribed speech data. Fig. 1 provides details of the transcribed audio corpus, including gender and recording environments for each language.Fig. 1. Details of transcribed audio files

  9. 785 Million Language Translation Database for AI

    • kaggle.com
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ramakrishnan Lakshmanan (2023). 785 Million Language Translation Database for AI [Dataset]. https://www.kaggle.com/datasets/ramakrishnan1984/785-million-language-translation-database-ai-ml
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ramakrishnan Lakshmanan
    License

    http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

    Description

    Our groundbreaking translation dataset represents a monumental advancement in the field of natural language processing and machine translation. Comprising a staggering 785 million records, this corpus bridges language barriers by offering translations from English to an astonishing 548 languages. The dataset promises to be a cornerstone resource for researchers, engineers, and developers seeking to enhance their machine translation models, cross-lingual analysis, and linguistic investigations.

    Size of the dataset – 41GB(Uncompressed) and Compressed – 20GB

    Key Features:

    Scope and Scale: With a comprehensive collection of 785 million records, this dataset provides an unparalleled wealth of translated text. Each record consists of an English sentence paired with its translation in one of the 548 target languages, enabling multi-directional translation applications.

    Language Diversity: Encompassing translations into 548 languages, this dataset represents a diverse array of linguistic families, dialects, and scripts. From widely spoken languages to those with limited digital representation, the dataset bridges communication gaps on a global scale.

    Quality and Authenticity: The translations have been meticulously curated, verified, and cross-referenced to ensure high quality and authenticity. This attention to detail guarantees that the dataset is not only extensive but also reliable, serving as a solid foundation for machine learning applications. Data is collected from various open datasets for my personal ML projects and looking to share it to team.

    Use Case Versatility: Researchers and practitioners across a spectrum of domains can harness this dataset for a myriad of applications. It facilitates the training and evaluation of machine translation models, empowers cross-lingual sentiment analysis, aids in linguistic typology studies, and supports cultural and sociolinguistic investigations.

    Machine Learning Advancement: Machine translation models, especially neural machine translation (NMT) systems, can leverage this dataset to enhance their training. The large-scale nature of the dataset allows for more robust and contextually accurate translation outputs.

    Fine-tuning and Customization: Developers can fine-tune translation models using specific language pairs, offering a powerful tool for specialized translation tasks. This customization capability ensures that the dataset is adaptable to various industries and use cases.

    Data Format: The dataset is provided in a structured json format, facilitating easy integration into existing machine learning pipelines. This structured approach expedites research and experimentation. Json format contains the English word and equivalent word as single record. Data was exported from MongoDB database to ensure the uniqueness of the record. Each of the record is unique and sorted.

    Access: The dataset is available for academic and research purposes, enabling the global AI community to contribute to and benefit from its usage. A well-documented API and sample code are provided to expedite exploration and integration.

    The English-to-548-languages translation dataset represents an incredible leap forward in advancing multilingual communication, breaking down barriers to understanding, and fostering collaboration on a global scale. It holds the potential to reshape how we approach cross-lingual communication, linguistic studies, and the development of cutting-edge translation technologies.

    Dataset Composition: The dataset is a culmination of translations from English, a widely spoken and understood language, into 548 distinct languages. Each language represents a unique linguistic and cultural background, providing a rich array of translation contexts. This diverse range of languages spans across various language families, regions, and linguistic complexities, making the dataset a comprehensive repository for linguistic research.

    Data Volume and Scale: With a staggering 785 million records, the dataset boasts an immense scale that captures a vast array of translations and linguistic nuances. Each translation entry consists of an English source text paired with its corresponding translation in one of the 548 target languages. This vast corpus allows researchers and practitioners to explore patterns, trends, and variations across languages, enabling the development of robust and adaptable translation models.

    Linguistic Coverage: The dataset covers an extensive set of languages, including but not limited to Indo-European, Afroasiatic, Sino-Tibetan, Austronesian, Niger-Congo, and many more. This broad linguistic coverage ensures that languages with varying levels of grammatical complexity, vocabulary richness, and syntactic structures are included, enhancing the applicability of translation models across diverse linguistic landscapes.

    Dataset Preparation: The translation ...

  10. z

    Data from: LaFresCat: a Catalan multi-accent speech dataset for...

    • zenodo.org
    application/gzip, txt
    Updated Feb 18, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2025). LaFresCat: a Catalan multi-accent speech dataset for text-to-speech [Dataset]. http://doi.org/10.21437/iberspeech.2024-42
    Explore at:
    txt, application/gzipAvailable download formats
    Dataset updated
    Feb 18, 2025
    Dataset provided by
    Zenodo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    LaFresCat Multiaccent

    We present LaFresCat, the first Catalan multiaccented and multispeaker dataset.

    This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Commercial use is only possible through licensing by the voice artists. For further information, contact langtech@bsc.es and lafrescaproduccions@gmail.com.

    Dataset Details

    Dataset Description

    The audios from this dataset have been created with professional studio recordings by professional voice actors in Lafresca Creative Studio. This is the raw version of the dataset, no resampling or trimming has been applied to the audios. Audios are stored in wav format at 48khz sampling rate

    In total, there are 4 different accents, with 2 speakers per accent (female and male). After trimming, accumulates a total of 3,75h (divided by speaker IDs) as follows:

    • Balear

      • olga -> 23.5 min
      • quim -> 30.93 min
    • Central

      • elia -> 33.14 min
      • grau -> 37,86 min
    • Occidental (North-Western)

      • emma -> 28,67 min
      • pere -> 25,12 min
    • Valencia

      • gina -> 22,25 min
      • lluc -> 23,58 min

    Uses

    The purpose of this dataset is mainly for training text-to-speech and automatic speech recognition models in Catalan accents.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    The dataset consists of 2858 audios and transcriptions in the following structure:

    lafresca_multiaccent_raw
    ├── balear
    │ ├── olga
    │ ├── olga.txt
    │ ├── quim
    │ └── quim.txt
    ├── central
    │ ├── elia
    │ ├── elia.txt
    │ ├── grau
    │ └── grau.txt
    ├── full_filelist.txt
    ├── occidental
    │ ├── emma
    │ ├── emma.txt
    │ ├── pere
    │ └── pere.txt
    └── valencia
    ├── gina
    ├── gina.txt
    ├── lluc
    └── lluc.txt

    Metadata of the dataset can be found in the file `full_filelist.txt` , each line represents an audio and follows the format:

    audio_path | speaker_id | transcription

    The speaker ids have the following mapping:

    "quim": 0,
    "olga": 1,
    "grau": 2,
    "elia": 3,
    "pere": 4,
    "emma": 5,
    "lluc": 6,
    "gina": 7

    Dataset Creation

    This dataset has been created by members of the Language Technologies unit from the Life Sciences department of the Barcelona Supercomputing Center, except the valencian sentences which were created with the support of Cenid, the Digital Intelligence Center of the University of Alicante. The voices belong to professional voice actors and they've been recorded in Lafresca Creative Studio.

    Source Data

    The data presented in this dataset is the source data.

    Data Collection and Processing

    These are the technical details of the data collection and processing:

    • Microphone: Austrian Audio oc818

    • Preamp: Focusrite ISA Two

    • Audio Interface: Antelope Orion 32+

    • DAW: ProTools 2023.6.0

    Processing:

    • Noise Gate: C1 Gate

    • Compression BF-76

    • De-Esser Renaissance

    • EQ Maag EQ2

    • EQ FabFilter Pro-Q3

    • Limiter: L1 Ultramaximizer

    Here's the information about the speakers:

    DialectGenderCounty
    CentralmaleBarcelonès
    CentralfemaleBarcelonès
    BalearfemalePla de Mallorca
    BalearmaleLlevant
    OccidentalmaleBaix Ebre
    OccidentalfemaleBaix Ebre
    ValencianfemaleRibera Alta
    ValencianmaleLa Plana Baixa

    Who are the source data producers?

    The Language Technologies team from the Life Sciences department at the Barcelona Supercomputing Center developed this dataset. It features recordings by professional voice actors made at Lafresca Creative Studio.

    Annotations

    In order to check whether or not there were any errors in the transcriptions of the audios, we created a Label Studio space. In that space, we manually listened to subset of the dataset, and compared what we heard with the transcription. If the transcription was mistaken, we corrected it.

    Personal and Sensitive Information

    The dataset consists of professional voice actors who have recorded their voice. You agree to not attempt to determine the identity of speakers in this dataset.

    Bias, Risks, and Limitations

    Training a Text-to-Speech (TTS) model by fine-tuning with a Catalan speaker who speaks a particular dialect presents significant limitations. Mostly, the challenge is in capturing the full range of variability inherent in that accent. Each dialect has its own unique phonetic, intonational, and prosodic characteristics that can vary greatly even within a single linguistic region. Consequently, a TTS model trained on a narrow dialect sample will struggle to generalize across different accents and sub-dialects, leading to reduced accuracy and naturalness. Additionally, achieving a standard representation is exceedingly difficult because linguistic features can differ markedly not only between dialects but also among individual speakers within the same dialect group. These variations encompass subtle nuances in pronunciation, rhythm, and speech patterns that are challenging to standardize in a model trained on a limited dataset.

    Funding

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project, in addition the Valencian sentences have been created within the framework of the NEL-VIVES project 2022/TL22/00215334.

    Dataset Card Contact

    langtech@bsc.es

  11. a

    ACS Population Characteristics: Spoken Languages

    • dcra-cdo-dcced.opendata.arcgis.com
    • gis.data.alaska.gov
    • +2more
    Updated Sep 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dept. of Commerce, Community, & Economic Development (2019). ACS Population Characteristics: Spoken Languages [Dataset]. https://dcra-cdo-dcced.opendata.arcgis.com/datasets/acs-population-characteristics-spoken-languages
    Explore at:
    Dataset updated
    Sep 4, 2019
    Dataset authored and provided by
    Dept. of Commerce, Community, & Economic Development
    Area covered
    Description

    Counts and breakdown of languages used data with margins of error for Alaskan Communities/Places and aggregation at Borough/CDA and State level for recent 5-year American Community Survey (ACS) intervals. The 5-year interval data sets are published approximately 1/2 a period later than the End Year listed - for instance the interval ending in 2019 is published in mid-2021.Source: US Census Bureau, American Community SurveyThis data has been visualized in a Geographic Information Systems (GIS) format and is provided as a service in the DCRA Information Portal by the Alaska Department of Commerce, Community, and Economic Development Division of Community and Regional Affairs (SOA DCCED DCRA), Research and Analysis section. SOA DCCED DCRA Research and Analysis is not the authoritative source for this data. For more information and for questions about this data, see: US Census - Language UseUSE CONSTRAINTS: The Alaska Department of Commerce, Community, and Economic Development (DCCED) provides the data in this application as a service to the public. DCCED makes no warranty, representation, or guarantee as to the content, accuracy, timeliness, or completeness of any of the data provided on this site. DCCED shall not be liable to the user for damages of any kind arising out of the use of data or information provided. DCCED is not the authoritative source for American Community Survey data, and any data or information provided by DCCED is provided "as is". Data or information provided by DCCED shall be used and relied upon only at the user's sole risk. For information about the American Community Survey, click here.

  12. P

    SLURP Dataset

    • paperswithcode.com
    Updated Apr 12, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Emanuele Bastianelli; Andrea Vanzo; Pawel Swietojanski; Verena Rieser (2023). SLURP Dataset [Dataset]. https://paperswithcode.com/dataset/slurp
    Explore at:
    Dataset updated
    Apr 12, 2023
    Authors
    Emanuele Bastianelli; Andrea Vanzo; Pawel Swietojanski; Verena Rieser
    Description

    A new challenging dataset in English spanning 18 domains, which is substantially bigger and linguistically more diverse than existing datasets.

  13. e

    Top Languages Spoken in London Boroughs and MSOAs

    • data.europa.eu
    unknown
    Updated Jul 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    census2011@london.gov.uk (2021). Top Languages Spoken in London Boroughs and MSOAs [Dataset]. https://data.europa.eu/data/datasets/top-languages-spoken-in-london-boroughs-and-msoas?locale=ga
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Jul 19, 2021
    Dataset authored and provided by
    census2011@london.gov.uk
    Area covered
    London
    Description

    This dataset shows the most spoken languages by borough and MSOAs in London. It provides numbers of the population aged 3+ who speak specified languages as their main language.

    Main language is from 2011 Census (detailed) - Census table QS204EW.

    This data is presented alongside Annual Population Survey (APS) data showing the top nationalities of residents in January - December 2019 by borough. The top 3 non-British nationalities are at the far right of the table. This is to highlight areas which may now have other common non-British languages spoken compared to 2011 (the year in which the Census information was gathered). The top non-British nationalities in 2019, which did not feature in 2011 as one of the most spoken non-British languages, are highlighted in column AD.

    The APS has a sample of around 320,000 people in the UK (around 28,000 in London). As such all figures must be treated with some caution. Estimates for non-British nationalities at borough level that are below 10,000 are considered too small to be reliable and should be treated with additional caution.

    MSOA codes have now been linked to House of Commons MSOA names

  14. d

    Data from: Language structure is influenced by the number of speakers but...

    • datadryad.org
    • data.niaid.nih.gov
    zip
    Updated Jan 31, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander Koplenig (2019). Language structure is influenced by the number of speakers but seemingly not by the proportion of non-native speakers [Dataset]. http://doi.org/10.5061/dryad.g0m3b82
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 31, 2019
    Dataset provided by
    Dryad
    Authors
    Alexander Koplenig
    Time period covered
    2019
    Description

    Data table to reproduce all resultsdata_table.xlsxStata_filesStata 14 code to reproduce all results

  15. a

    Languages spoken by tract, ACS

    • hub.arcgis.com
    • massachsuetts-environmental-justice-datasets-mass-eoeea.hub.arcgis.com
    Updated May 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MA Executive Office of Energy and Environmental Affairs (2021). Languages spoken by tract, ACS [Dataset]. https://hub.arcgis.com/datasets/Mass-EOEEA::languages-spoken-by-tract-acs-/about
    Explore at:
    Dataset updated
    May 19, 2021
    Dataset provided by
    Massachusetts Executive Office of Energy and Environmental Affairs
    Authors
    MA Executive Office of Energy and Environmental Affairs
    Area covered
    Description

    The American Community Survey, Table B16001 provided detailed individual-level language estimates at the tract level of 42 non-English language categories, tabulated by the English-speaking ability. Two sets of languages data are included here, with population counts and percentages for both:the tract population speaking languages other than English, regardless of English=speaking ability, identified by the language name, and the languages spoken other than English by the tract population who does not speak English 'very well', identified by the language name followed by "_Enw".The default pop-up for this service presents the second of these data: languages spoken other than English by the tract population who does not speak English 'very well'.In part because of privacy concerns with the very small counts in some categories in Table B16001, the Census changed the American Community Survey estimates of the languages spoken by individuals. In 2016, the number of categories previously presented in Table B16001 was reduced to reflect the most commonly spoken languages, and several languages spoken in Massachusetts were grouped into generalized (i.e., "Other...") categories.Table B16001 has been renamed Table C16001 with these generalized categories. Therefore, although the information presented in this datalayer is not current, and these data cannot be updated.

  16. G

    Proportion of Population by Language Spoken Most Often at Home, Alberta...

    • ouvert.canada.ca
    • data.urbandatacentre.ca
    • +3more
    csv, html, pdf
    Updated Jul 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Government of Alberta (2024). Proportion of Population by Language Spoken Most Often at Home, Alberta Economic Regions [Dataset]. https://ouvert.canada.ca/data/dataset/8d334793-ff24-42bc-8692-0bb86b5211d2
    Explore at:
    csv, html, pdfAvailable download formats
    Dataset updated
    Jul 24, 2024
    Dataset provided by
    Government of Alberta
    License

    Open Government Licence - Canada 2.0https://open.canada.ca/en/open-government-licence-canada
    License information was derived automatically

    Time period covered
    Jun 10, 2006 - Jun 10, 2011
    Area covered
    Alberta
    Description

    This Alberta Official Statistic describes the proportion of population based on language spoken most often at home in each economic region as reported in the 2011 population census. Alberta is divided into eight economic regions as follows: Lethbridge – Medicine -Hat; Camrose-Drumheller; Calgary; Banff – Jasper – Rocky Mountain House; Red Deer; Edmonton; Athabasca – Grande Prairie – Peace River; and Wood Buffalo – Cold Lake.

  17. Primary language spoken by the Medicaid and CHIP population

    • catalog.data.gov
    • data.virginia.gov
    • +2more
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Centers for Medicare & Medicaid Services (2025). Primary language spoken by the Medicaid and CHIP population [Dataset]. https://catalog.data.gov/dataset/primary-language-spoken-by-the-medicaid-and-chip-population
    Explore at:
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    Centers for Medicare & Medicaid Services
    Description

    This data set includes annual counts and percentages of Medicaid and Children’s Health Insurance Program (CHIP) enrollees by primary language spoken (English, Spanish, and all other languages). Results are shown overall; by state; and by five subpopulation topics: race and ethnicity, age group, scope of Medicaid and CHIP benefits, urban or rural residence, and eligibility category. These results were generated using Transformed Medicaid Statistical Information System (T-MSIS) Analytic Files (TAF) Release 1 data and the Race/Ethnicity Imputation Companion File. This data set includes Medicaid and CHIP enrollees in all 50 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands who were enrolled for at least one day in the calendar year, except where otherwise noted. Enrollees in Guam, American Samoa, the Northern Mariana Islands, and select states with data quality issues with the primary language variable in TAF are not included. Results shown for the race and ethnicity subpopulation topic exclude enrollees in the U.S. Virgin Islands. Results shown overall (where subpopulation topic is "Total enrollees") exclude enrollees younger than age 5 and enrollees in the U.S. Virgin Islands. Results for states with TAF data quality issues in the year have a value of "Unusable data." Some rows in the data set have a value of "DS," which indicates that data were suppressed according to the Centers for Medicare & Medicaid Services’ Cell Suppression Policy for values between 1 and 10. This data set is based on the brief: "Primary language spoken by the Medicaid and CHIP population in 2020." Enrollees are assigned to a primary language category based on their reported ISO language code in TAF (English/missing, Spanish, and all other language codes) (Primary Language). Enrollees are assigned to a race and ethnicity subpopulation using the state-reported race and ethnicity information in TAF when it is available and of good quality; if it is missing or unreliable, race and ethnicity is indirectly estimated using an enhanced version of Bayesian Improved Surname Geocoding (BISG) (Race and ethnicity of the national Medicaid and CHIP population in 2020). Enrollees are assigned to an age group subpopulation using age as of December 31st of the calendar year. Enrollees are assigned to the comprehensive benefits or limited benefits subpopulation according to the criteria in the "Identifying Beneficiaries with Full-Scope, Comprehensive, and Limited Benefits in the TAF" DQ Atlas brief. Enrollees are assigned to an urban or rural subpopulation based on the 2010 Rural-Urban Commuting Area (RUCA) code associated with their home or mailing address ZIP code in TAF (Rural Medicaid and CHIP enrollees in 2020). Enrollees are assigned to an eligibility category subpopulation using their latest reported eligibility group code, CHIP code, and age in the calendar year. Please refer to the full brief for additional context about the methodology and detailed findings. Future updates to this data set will include more recent data years as the TAF data become available.

  18. Tamil (Tamizh) Wikipedia Text Dataset for NLP

    • kaggle.com
    Updated Nov 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Younus_Mohamed (2024). Tamil (Tamizh) Wikipedia Text Dataset for NLP [Dataset]. http://doi.org/10.34740/kaggle/dsv/9884525
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 12, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Younus_Mohamed
    License

    Attribution-ShareAlike 3.0 (CC BY-SA 3.0)https://creativecommons.org/licenses/by-sa/3.0/
    License information was derived automatically

    Description

    This dataset is part of a larger mission to transform Tamil into a high-resource language in the field of Natural Language Processing (NLP). As one of the oldest and most culturally rich languages, Tamil has a unique linguistic structure, yet it remains underrepresented in the NLP landscape. This dataset, extracted from Tamil Wikipedia, serves as a foundational resource to support Tamil language processing, text mining, and machine learning applications.

    What’s Included

    - Text Data: This dataset contains over 569,000 articles in raw text form, extracted from Tamil Wikipedia. The collection is ideal for language model training, word frequency analysis, and text mining.

    - Scripts and Processing Tools: Code snippets are provided for processing .bz2 compressed files, generating word counts, and handling data for NLP applications.

    Why This Dataset?

    Despite having a documented lexicon of over 100,000 words, only a fraction of these are actively used in everyday communication. The largest available Tamil treebank currently holds only 10,000 words, limiting the scope for training accurate language models. This dataset aims to bridge that gap by providing a robust, open-source corpus for researchers, developers, and linguists who want to work on Tamil language technologies.

    ** How You Can Use This Dataset**

    - Language Modeling: Train or fine-tune models like BERT, GPT, or LSTM-based language models for Tamil. - Linguistic Research: Analyze Tamil morphology, syntax, and vocabulary usage. - Data Augmentation: Use the raw text to generate augmented data for multilingual NLP applications. - Word Embeddings and Semantic Analysis: Create embeddings for Tamil words, useful in multilingual setups or standalone applications.

    Let’s Collaborate!

    I believe that advancing Tamil in NLP cannot be a solo effort. Contributions in the form of additional data, annotations, or even new tools for Tamil language processing are welcome! By working together, we can make Tamil a truly high-resource language in NLP.

    License

    This dataset is based on content from Tamil Wikipedia and is shared under the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). Proper attribution to Wikipedia is required when using this data.

  19. s

    Population according to language, age and sex 1990-2017 - Datasets - This...

    • store.smartdatahub.io
    Updated Feb 12, 2019
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2019). Population according to language, age and sex 1990-2017 - Datasets - This service has been deprecated - please visit https://www.smartdatahub.io/ to access data. See the About page for details. // [Dataset]. https://store.smartdatahub.io/dataset/fi_statistics_finland_population_according_to_language_age_and_sex_1990_2017
    Explore at:
    Dataset updated
    Feb 12, 2019
    Description

    Population according to language, age and sex 1990-2017

  20. l

    Census 21 - Main Language MSOA

    • data.leicester.gov.uk
    csv, excel, geojson +1
    Updated Aug 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Census 21 - Main Language MSOA [Dataset]. https://data.leicester.gov.uk/explore/dataset/census-21-main-language-msoa/
    Explore at:
    json, geojson, excel, csvAvailable download formats
    Dataset updated
    Aug 22, 2023
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Description

    The census is undertaken by the Office for National Statistics every 10 years and gives us a picture of all the people and households in England and Wales. The most recent census took place in March of 2021.The census asks every household questions about the people who live there and the type of home they live in. In doing so, it helps to build a detailed snapshot of society. Information from the census helps the government and local authorities to plan and fund local services, such as education, doctors' surgeries and roads.Key census statistics for Leicester are published on the open data platform to make information accessible to local services, voluntary and community groups, and residents. There is also a dashboard published showcasing various datasets from the census allowing users to view data for the MSOAs of Leicester and compare this with Leicester overall statistics.Further information about the census and full datasets can be found on the ONS website - https://www.ons.gov.uk/census/aboutcensus/censusproductsMain languageThis dataset provides Census 2021 estimates that classify usual residents in England and Wales by their main language. The estimates are as at Census Day, 21 March 2021.Main language is a person's first or preferred language. They may speak other languages as well. A main language is provided only for residents age 3 and above. Residents age below 3 years will appear as ‘Does not apply’. Please note that some organisations exclude those below 3 years when calculating percentages for this variable.This dataset contains information for the MSOAs of Leicester City.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Statista (2025). The most spoken languages worldwide 2025 [Dataset]. https://www.statista.com/statistics/266808/the-most-spoken-languages-worldwide/
Organization logo

The most spoken languages worldwide 2025

Explore at:
435 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Apr 14, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2025
Area covered
World
Description

In 2025, there were around 1.53 billion people worldwide who spoke English either natively or as a second language, slightly more than the 1.18 billion Mandarin Chinese speakers at the time of survey. Hindi and Spanish accounted for the third and fourth most widespread languages that year. Languages in the United States The United States does not have an official language, but the country uses English, specifically American English, for legislation, regulation, and other official pronouncements. The United States is a land of immigration, and the languages spoken in the United States vary as a result of the multicultural population. The second most common language spoken in the United States is Spanish or Spanish Creole, which over than 43 million people spoke at home in 2023. There were also 3.5 million Chinese speakers (including both Mandarin and Cantonese),1.8 million Tagalog speakers, and 1.57 million Vietnamese speakers counted in the United States that year. Different languages at home The percentage of people in the United States speaking a language other than English at home varies from state to state. The state with the highest percentage of population speaking a language other than English is California. About 45 percent of its population was speaking a language other than English at home in 2023.

Search
Clear search
Close search
Google apps
Main menu