100+ datasets found
  1. MCB_languages_county

    • kaggle.com
    Updated Oct 1, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Marisol Brewster
    Description

    Context

    This is a dataset I found online through the Google Dataset Search portal.

    Content

    The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

    The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

    The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

    These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

    Acknowledgements

    Sources:

    Google Dataset Search: https://toolbox.google.com/datasetsearch

    2009-2013 American Community Survey

    Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

    Downloaded From: https://data.world/kvaughn/languages-county

    Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

  2. A

    ‘Languages spoken across various nations’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Feb 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Languages spoken across various nations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-languages-spoken-across-various-nations-a8e8/latest
    Explore at:
    Dataset updated
    Feb 13, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Languages spoken across various nations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubhamptrivedi/languages-spoken-across-various-nations on 13 February 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    I was fascinated by this type of data as this gives a slight peek on cultural diversity of a nation and what kind of literary work to be expected from that nation

    Content

    This dataset is a collection of all the languages that are spoken by the different nations around the world. Nowadays, Most nations are bi or even trilingual in nature this can be due to different cultures and different groups of people are living in the same nation in harmony. This type of data can be very useful for linguistic research, market research, advertising purposes, and the list goes on.

    Acknowledgements

    This dataset was published on the site Infoplease which is a general information website.

    Inspiration

    I think this dataset can be useful to understand which type of literature publication can be done for maximum penetration of the market base

    --- Original source retains full ownership of the source dataset ---

  3. E

    GlobalPhone Polish

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Polish [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0320/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Polish part of GlobalPhone was collected from altogether 102 native speakers in Poland, of which 48 speakers were female and 54 speakers were male. The majority of speakers are between 20 and 39 years old, the age distribution ranges from 18 to 65 years. Most of the speakers are non-smokers in good health conditions. Each speaker read on average about 100 utterances from newspaper articles, in total we recorded 10130 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small and large rooms, about half of the recordings took place under very quiet noise conditions, the other half with moderate background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for reco...

  4. h

    jampatoisnli

    • huggingface.co
    Updated Jul 21, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 21, 2023
    Authors
    Ruth-Ann Armstrong
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    Dataset Card for [Dataset Name]

      Dataset Summary
    

    JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.

  5. h

    XLingHealth

    • huggingface.co
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset authored and provided by
    Georgia Tech CLAWS Lab
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for "XLingHealth"

    XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

      Statistics
    

    Dataset

    Examples

    Words (Q)

    Words (A)

    HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

    LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

    MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

    Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

  6. Language spoken at Home (Census 2016)

    • pacificgeoportal.com
    • cacgeoportal.com
    • +1more
    Updated May 26, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://www.pacificgeoportal.com/datasets/esriau::language-spoken-at-home-census-2016/about
    Explore at:
    Dataset updated
    May 26, 2019
    Dataset provided by
    Esrihttp://esri.com/
    Authors
    Esri Australia
    Description

    Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.

  7. E

    GlobalPhone Portuguese (Brazilian)

    • catalogue.elra.info
    • live.european-language-grid.eu
    Updated Jun 26, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
    Explore at:
    Dataset updated
    Jun 26, 2017
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    Area covered
    Brazil
    Description

    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).

  8. English Conversation and Monologue speech dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

    Format

    16kHz, 16 bit, wav, mono channel;

    Content category

    Including self-media, conversation, live, lecture, variety-show, etc;

    Recording environment

    Low background noise;

    Country

    America(USA);

    Language(Region) Code

    en-US;

    Language

    English;

    Features of annotation

    Transcription text, timestamp, speaker ID, gender.

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%

    Licensing Information

    Commercial License

  9. F

    Spanish (Spain) Call Center Data for Realestate AI

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Spanish (Spain) Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-spanish-spain
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Spain
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    This Spanish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.

    Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.

    Speech Data

    The dataset features 30 hours of dual-channel call center recordings between native Spanish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.

    Participant Diversity:
    Speakers: 60 native Spanish speakers from our verified contributor community.
    Regions: Representing different provinces across Spain to ensure accent and dialect variation.
    Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.
    Recording Details:
    Conversation Nature: Naturally flowing, unscripted agent-customer discussions.
    Call Duration: Average 5–15 minutes per call.
    Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.
    Recording Environment: Captured in noise-free and echo-free conditions.

    Topic Diversity

    This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.

    Inbound Calls:
    Property Inquiries
    Rental Availability
    Renovation Consultation
    Property Features & Amenities
    Investment Property Evaluation
    Ownership History & Legal Info, and more
    Outbound Calls:
    New Listing Notifications
    Post-Purchase Follow-ups
    Property Recommendations
    Value Updates
    Customer Satisfaction Surveys, and others

    Such domain-rich variety ensures model generalization across common real estate support conversations.

    Transcription

    All recordings are accompanied by precise, manually verified transcriptions in JSON format.

    Transcription Includes:
    Speaker-Segmented Dialogues
    Time-coded Segments
    Non-speech Tags (e.g., background noise, pauses)
    High transcription accuracy with word error rate below 5% via dual-layer human review.

    These transcriptions streamline ASR and NLP development for Spanish real estate voice applications.

    Metadata

    Detailed metadata accompanies each participant and conversation:

    Participant Metadata: ID, age, gender, location, accent, and dialect.
    Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

    This enables smart filtering, dialect-focused model training, and structured dataset exploration.

    Usage and Applications

    This dataset is ideal for voice AI and NLP systems built for the real estate sector:

  10. f

    English language corpora.

    • plos.figshare.com
    xls
    Updated Jun 2, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). English language corpora. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.

  11. h

    BanglaNLP

    • huggingface.co
    Updated Jun 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Likhon Sheikh (2025). BanglaNLP [Dataset]. https://huggingface.co/datasets/likhonsheikh/BanglaNLP
    Explore at:
    Dataset updated
    Jun 1, 2025
    Authors
    Likhon Sheikh
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    BanglaNLP: Bengali-English Parallel Dataset Tools

    BanglaNLP is a comprehensive toolkit for creating high-quality Bengali-English parallel datasets from news sources, designed to improve machine translation and other cross-lingual NLP tasks for the Bengali language. Our work addresses the critical shortage of high-quality parallel data for Bengali, the 7th most spoken language in the world with over 230 million speakers.

      🏆 Impact & Recognition
    

    120K+ Sentence Pairs:… See the full description on the dataset page: https://huggingface.co/datasets/likhonsheikh/BanglaNLP.

  12. Z

    IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

    • data.niaid.nih.gov
    Updated Jan 27, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
    Explore at:
    Dataset updated
    Jan 27, 2024
    Dataset provided by
    Firmansyah, Asep Fajar
    Gusmita, Ria Hari
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IndQNER

    IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

    3117 sentences

    62027 tokens

    2475 named entities

    18 named entity categories

    Named Entity Classes

    The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

    Allah

    Allah's Throne

    Artifact

    Astronomical body

    Event

    False deity

    Holy book

    Language

    Angel

    Person

    Messenger

    Prophet

    Sentient

    Afterlife location

    Geographical location

    Color

    Religion

    Food

    Fruit

    The book of Allah

    Annotation Stage

    There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

    Anggita Maharani Gumay Putri

    Muhammad Destamal Junas

    Naufaldi Hafidhigbal

    Nur Kholis Azzam Ubaidillah

    Puspitasari

    Septiany Nur Anggita

    Wilda Nurjannah

    William Santoso

    Verification Stage

    We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

    Dr. Eva Nugraha, M.Ag.

    Dr. Jauhar Azizy, MA

    Dr. Lilik Ummi Kultsum, MA

    Evaluation

    We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

    Supervised Learning Setting

    The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.94 0.92 0.93

    256 20 0.99 0.97 0.98

    256 40 0.96 0.96 0.96

    256 100 0.97 0.96 0.96

    512 10 0.92 0.92 0.92

    512 20 0.96 0.95 0.96

    512 40 0.97 0.95 0.96

    512 100 0.97 0.95 0.96

    Transfer Learning Setting

    We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

    Maximum sequence length Number of e-poch Precision Recall F1 score

    256 10 0.67 0.65 0.65

    256 20 0.60 0.59 0.59

    256 40 0.75 0.72 0.71

    256 100 0.73 0.68 0.68

    512 10 0.72 0.62 0.64

    512 20 0.62 0.57 0.58

    512 40 0.72 0.66 0.67

    512 100 0.68 0.68 0.67

    This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

    How to Cite

    @InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

    Contact

    If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id

  13. m

    Pashtu Language Digits Dataset (PLDD)

    • data.mendeley.com
    Updated Mar 25, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khalil khan (2022). Pashtu Language Digits Dataset (PLDD) [Dataset]. http://doi.org/10.17632/zbyc7sgp63.2
    Explore at:
    Dataset updated
    Mar 25, 2022
    Authors
    khalil khan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.

  14. u

    GLips - German Lipreading Dataset

    • fdr.uni-hamburg.de
    zip
    Updated Mar 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan; Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan (2022). GLips - German Lipreading Dataset [Dataset]. http://doi.org/10.25592/uhhfdm.10048
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 1, 2022
    Dataset provided by
    University of Hamburg
    Authors
    Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan; Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB.

  15. Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

    • zenodo.org
    • data.niaid.nih.gov
    bin
    Updated Oct 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D. (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.13896353
    Explore at:
    binAvailable download formats
    Dataset updated
    Oct 21, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 6, 2024
    Description

    Please cite the following paper when using this dataset:

    N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

    Abstract

    The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

    For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

    The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

    There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

    The following is a description of the attributes present in this dataset

    • Post ID: Unique ID of each Instagram post
    • Post Description: Complete description of each post in the language in which it was originally published
    • Date: Date of publication in MM/DD/YYYY format
    • Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API
    • Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API
    • Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

    Open Research Questions

    This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

    1. How does sentiment toward COVID-19 vary across different languages?
    2. How has public sentiment toward COVID-19 evolved from 2020 to the present?
    3. How do cultural differences affect social media discourse about COVID-19 across various languages?
    4. How has COVID-19 impacted mental health, as reflected in social media posts across different languages?
    5. How effective were public health campaigns in shifting public sentiment in different languages?
    6. What patterns of vaccine hesitancy or support are present in different languages?
    7. How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?
    8. What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?
    9. How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?
    10. What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

    All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).

  16. s

    120 Million Word Spanish Corpus

    • marketplace.sshopencloud.eu
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
    Explore at:
    Dataset updated
    Apr 24, 2020
    Description

    Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.

  17. F

    Mexican Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    Mexico
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Mexican Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

  18. e

    GLips - German Lipreading Dataset - Dataset - B2FIND

    • b2find.eudat.eu
    Updated May 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). GLips - German Lipreading Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/aca4d0c4-9e81-560c-b3af-de011686ecc6
    Explore at:
    Dataset updated
    May 1, 2023
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB. Copyright of original data: Hessian Parliament (https://hessischer-landtag.de). If you use this dataset, you agree to use it for research purpose only and to cite the following reference in any works that make any use of the dataset.

  19. A

    ‘Extinct Languages’ analyzed by Analyst-2

    • analyst-2.ai
    Updated Jan 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Extinct Languages’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-extinct-languages-6686/latest
    Explore at:
    Dataset updated
    Jan 28, 2022
    Dataset authored and provided by
    Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Analysis of ‘Extinct Languages’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/the-guardian/extinct-languages on 28 January 2022.

    --- Dataset description provided by original source is as follows ---

    Context

    A recent Guardian blog post asks: "How many endangered languages are there in the World and what are the chances they will die out completely?" The United Nations Education, Scientific and Cultural Organisation (UNESCO) regularly publishes a list of endangered languages, using a classification system that describes its danger (or completion) of extinction.

    Content

    The full detailed dataset includes names of languages, number of speakers, the names of countries where the language is still spoken, and the degree of endangerment. The UNESCO endangerment classification is as follows:

    • Vulnerable: most children speak the language, but it may be restricted to certain domains (e.g., home)
    • Definitely endangered: children no longer learn the language as a 'mother tongue' in the home
    • Severely endangered: language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves
    • Critically endangered: the youngest speakers are grandparents and older, and they speak the language partially and infrequently
    • Extinct: there are no speakers left

    Acknowledgements

    Data was originally organized and published by The Guardian, and can be accessed via this Datablog post.

    Inspiration

    • How can you best visualize this data?
    • Which rare languages are more isolated (Sicilian, for example) versus more spread out? Can you come up with a hypothesis for why that is the case?
    • Can you compare the number of rare speakers with more relatable figures? For example, are there more Romani speakers in the world than there are residents in a small city in the United States?

    --- Original source retains full ownership of the source dataset ---

  20. F

    Colombian Spanish General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). Colombian Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-colombia
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the Colombian Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Colombian Spanish communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Colombian accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Colombian Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native Colombian Spanish speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of Colombia to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple Spanish speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for Colombian Spanish.
    Voice Assistants: Build smart assistants capable of understanding natural Colombian conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
Organization logo

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description

Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

Search
Clear search
Close search
Google apps
Main menu