100+ datasets found

MCB_languages_county
kaggle.com
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description
Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
A
‘Languages spoken across various nations’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Languages spoken across various nations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-languages-spoken-across-various-nations-a8e8/latest
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Languages spoken across various nations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubhamptrivedi/languages-spoken-across-various-nations on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

I was fascinated by this type of data as this gives a slight peek on cultural diversity of a nation and what kind of literary work to be expected from that nation

Content

This dataset is a collection of all the languages that are spoken by the different nations around the world. Nowadays, Most nations are bi or even trilingual in nature this can be due to different cultures and different groups of people are living in the same nation in harmony. This type of data can be very useful for linguistic research, market research, advertising purposes, and the list goes on.

Acknowledgements

This dataset was published on the site Infoplease which is a general information website.

Inspiration

I think this dataset can be useful to understand which type of literature publication can be done for maximum penetration of the market base

--- Original source retains full ownership of the source dataset ---
E
GlobalPhone Polish
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Polish [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0320/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Polish part of GlobalPhone was collected from altogether 102 native speakers in Poland, of which 48 speakers were female and 54 speakers were male. The majority of speakers are between 20 and 39 years old, the age distribution ranges from 18 to 65 years. Most of the speakers are non-smokers in good health conditions. Each speaker read on average about 100 utterances from newspaper articles, in total we recorded 10130 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small and large rooms, about half of the recordings took place under very quiet noise conditions, the other half with moderate background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The text data used for reco...
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
h
XLingHealth
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
Explore at:
Dataset updated
Feb 7, 2024
Dataset authored and provided by
Georgia Tech CLAWS Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "XLingHealth"

XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

Statistics

Dataset

Examples

Words (Q)

Words (A)

HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.
Language spoken at Home (Census 2016)
pacificgeoportal.com
cacgeoportal.com
+1more
Updated May 26, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://www.pacificgeoportal.com/datasets/esriau::language-spoken-at-home-census-2016/about
Explore at:
Dataset updated
May 26, 2019
Dataset provided by
Esrihttp://esri.com/
Authors
Esri Australia
Description
Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.
E
GlobalPhone Portuguese (Brazilian)
catalogue.elra.info
live.european-language-grid.eu
Updated Jun 26, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2017). GlobalPhone Portuguese (Brazilian) [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0201/
Explore at:
Dataset updated
Jun 26, 2017
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Brazil
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.The Portuguese (Brazilian) corpus was produced using the Folha de Sao Paulo newspaper. It contains recordings of 102 speakers (54 males, 48 females) recorded in Porto Velho and Sao Paulo, Brazil. The following age distribution has been obtained: 6 speakers are below 19, 58 speakers are between 20 and 29, 27 speakers are between 30 and 39, 5 speakers are between 40 and 49, and 5 speakers are over 50 (1 speaker age is unknown).
English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
F
Spanish (Spain) Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Spanish (Spain) Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-spanish-spain
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Spain
Dataset funded by
FutureBeeAI
Description
Introduction
This Spanish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Spanish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Spanish speakers from our verified contributor community.

•
Regions: Representing different provinces across Spain to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Spanish real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
f
English language corpora.
plos.figshare.com
xls
Updated Jun 2, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem (2025). English language corpora. [Dataset]. http://doi.org/10.1371/journal.pone.0320701.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320701.t002
Dataset updated
Jun 2, 2025
Dataset provided by
PLOS ONE
Authors
Rimsha Muzaffar; Syed Yasser Arafat; Junaid Rashid; Jungeun Kim; Usman Naseem
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Advancements in deep learning have revolutionized numerous real-world applications, including image recognition, visual question answering, and image captioning. Among these, image captioning has emerged as a critical area of research, with substantial progress achieved in Arabic, Chinese, Uyghur, Hindi, and predominantly English. However, despite Urdu being a morphologically rich and widely spoken language, research in Urdu image captioning remains underexplored due to a lack of resources. This study creates a new Urdu Image Captioning Dataset (UCID) called UC-23-RY to fill in the gaps in Urdu image captioning. The Flickr30k dataset inspired the 159,816 Urdu captions in the dataset. Additionally, it suggests deep learning architectures designed especially for Urdu image captioning, including NASNetLarge-LSTM and ResNet-50-LSTM. The NASNetLarge-LSTM and ResNet-50-LSTM models achieved notable BLEU-1 scores of 0.86 and 0.84 respectively, as demonstrated through evaluation in this study accessing the model’s impact on caption quality. Additionally, it provides useful datasets and shows how well-suited sophisticated deep learning models are for improving automatic Urdu image captioning.
h
BanglaNLP
huggingface.co
Updated Jun 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Likhon Sheikh (2025). BanglaNLP [Dataset]. https://huggingface.co/datasets/likhonsheikh/BanglaNLP
Explore at:
Dataset updated
Jun 1, 2025
Authors
Likhon Sheikh
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
BanglaNLP: Bengali-English Parallel Dataset Tools

BanglaNLP is a comprehensive toolkit for creating high-quality Bengali-English parallel datasets from news sources, designed to improve machine translation and other cross-lingual NLP tasks for the Bengali language. Our work addresses the critical shortage of high-quality parallel data for Bengali, the 7th most spoken language in the world with over 230 million speakers.

🏆 Impact & Recognition

120K+ Sentence Pairs:… See the full description on the dataset page: https://huggingface.co/datasets/likhonsheikh/BanglaNLP.
Z
IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...
data.niaid.nih.gov
Updated Jan 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gusmita, Ria Hari (2024). IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the Quran [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7454891
Explore at:
Dataset updated
Jan 27, 2024
Dataset provided by
Firmansyah, Asep Fajar
Gusmita, Ria Hari
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IndQNER

IndQNER is a Named Entity Recognition (NER) benchmark dataset that was created by manually annotating 8 chapters in the Indonesian translation of the Quran. The annotation was performed using a web-based text annotation tool, Tagtog, and the BIO (Beginning-Inside-Outside) tagging format. The dataset contains:

3117 sentences

62027 tokens

2475 named entities

18 named entity categories

Named Entity Classes

The named entity classes were initially defined by analyzing the existing Quran concepts ontology. The initial classes were updated based on the information acquired during the annotation process. Finally, there are 20 classes, as follows:

Allah

Allah's Throne

Artifact

Astronomical body

Event

False deity

Holy book

Language

Angel

Person

Messenger

Prophet

Sentient

Afterlife location

Geographical location

Color

Religion

Food

Fruit

The book of Allah

Annotation Stage

There were eight annotators who contributed to the annotation process. They were informatics engineering students at the State Islamic University Syarif Hidayatullah Jakarta.

Anggita Maharani Gumay Putri

Muhammad Destamal Junas

Naufaldi Hafidhigbal

Nur Kholis Azzam Ubaidillah

Puspitasari

Septiany Nur Anggita

Wilda Nurjannah

William Santoso

Verification Stage

We found many named entity and class candidates during the annotation stage. To verify the candidates, we consulted Quran and Tafseer (content) experts who are lecturers at Quran and Tafseer Department at the State Islamic University Syarif Hidayatullah Jakarta.

Dr. Eva Nugraha, M.Ag.

Dr. Jauhar Azizy, MA

Dr. Lilik Ummi Kultsum, MA

Evaluation

We evaluated the annotation quality of IndQNER by performing experiments in two settings: supervised learning (BiLSTM+CRF) and transfer learning (IndoBERT fine-tuning).

Supervised Learning Setting

The implementation of BiLSTM and CRF utilized IndoBERT to provide word embeddings. All experiments used a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.94 0.92 0.93

256 20 0.99 0.97 0.98

256 40 0.96 0.96 0.96

256 100 0.97 0.96 0.96

512 10 0.92 0.92 0.92

512 20 0.96 0.95 0.96

512 40 0.97 0.95 0.96

512 100 0.97 0.95 0.96

Transfer Learning Setting

We performed several experiments with different parameters in IndoBERT fine-tuning. All experiments used a learning rate of 2e-5 and a batch size of 16. These are the results:

Maximum sequence length Number of e-poch Precision Recall F1 score

256 10 0.67 0.65 0.65

256 20 0.60 0.59 0.59

256 40 0.75 0.72 0.71

256 100 0.73 0.68 0.68

512 10 0.72 0.62 0.64

512 20 0.62 0.57 0.58

512 40 0.72 0.66 0.67

512 100 0.68 0.68 0.67

This dataset is also part of the NusaCrowd project which aims to collect Natural Language Processing (NLP) datasets for Indonesian and its local languages.

How to Cite

@InProceedings{10.1007/978-3-031-35320-8_12,author="Gusmita, Ria Hariand Firmansyah, Asep Fajarand Moussallem, Diegoand Ngonga Ngomo, Axel-Cyrille",editor="M{\'e}tais, Elisabethand Meziane, Faridand Sugumaran, Vijayanand Manning, Warrenand Reiff-Marganiec, Stephan",title="IndQNER: Named Entity Recognition Benchmark Dataset from the Indonesian Translation of the Quran",booktitle="Natural Language Processing and Information Systems",year="2023",publisher="Springer Nature Switzerland",address="Cham",pages="170--185",abstract="Indonesian is classified as underrepresented in the Natural Language Processing (NLP) field, despite being the tenth most spoken language in the world with 198 million speakers. The paucity of datasets is recognized as the main reason for the slow advancements in NLP research for underrepresented languages. Significant attempts were made in 2020 to address this drawback for Indonesian. The Indonesian Natural Language Understanding (IndoNLU) benchmark was introduced alongside IndoBERT pre-trained language model. The second benchmark, Indonesian Language Evaluation Montage (IndoLEM), was presented in the same year. These benchmarks support several tasks, including Named Entity Recognition (NER). However, all NER datasets are in the public domain and do not contain domain-specific datasets. To alleviate this drawback, we introduce IndQNER, a manually annotated NER benchmark dataset in the religious domain that adheres to a meticulously designed annotation guideline. Since Indonesia has the world's largest Muslim population, we build the dataset from the Indonesian translation of the Quran. The dataset includes 2475 named entities representing 18 different classes. To assess the annotation quality of IndQNER, we perform experiments with BiLSTM and CRF-based NER, as well as IndoBERT fine-tuning. The results reveal that the first model outperforms the second model achieving 0.98 F1 points. This outcome indicates that IndQNER may be an acceptable evaluation metric for Indonesian NER tasks in the aforementioned domain, widening the research's domain range.",isbn="978-3-031-35320-8"}

Contact

If you have any questions or feedback, feel free to contact us at ria.hari.gusmita@uni-paderborn.de or ria.gusmita@uinjkt.ac.id
m
Pashtu Language Digits Dataset (PLDD)
data.mendeley.com
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
khalil khan (2022). Pashtu Language Digits Dataset (PLDD) [Dataset]. http://doi.org/10.17632/zbyc7sgp63.2
Explore at:
Unique identifier
https://doi.org/10.17632/zbyc7sgp63.2
Dataset updated
Mar 25, 2022
Authors
khalil khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.
u
GLips - German Lipreading Dataset
fdr.uni-hamburg.de
zip
Updated Mar 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan; Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan (2022). GLips - German Lipreading Dataset [Dataset]. http://doi.org/10.25592/uhhfdm.10048
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25592/uhhfdm.10048
Dataset updated
Mar 1, 2022
Dataset provided by
University of Hamburg
Authors
Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan; Schwiebert, Gerald; Weber, Cornelius; Qu, Leyuan; Siqueira, Henrique; Wermter, Stefan
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB.
Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...
zenodo.org
data.niaid.nih.gov
bin
Updated Oct 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D. (2024). Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.13896353
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.13896353
Dataset updated
Oct 21, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nirmalya Thakur, Ph.D.; Nirmalya Thakur, Ph.D.
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 6, 2024
Description
Please cite the following paper when using this dataset:

N. Thakur, “Five Years of COVID-19 Discourse on Instagram: A Labeled Instagram Dataset of Over Half a Million Posts for Multilingual Sentiment Analysis”, Proceedings of the 7th International Conference on Machine Learning and Natural Language Processing (MLNLP 2024), Chengdu, China, October 18-20, 2024 (Paper accepted for publication, Preprint available at: https://arxiv.org/abs/2410.03293)

Abstract

The outbreak of COVID-19 served as a catalyst for content creation and dissemination on social media platforms, as such platforms serve as virtual communities where people can connect and communicate with one another seamlessly. While there have been several works related to the mining and analysis of COVID-19-related posts on social media platforms such as Twitter (or X), YouTube, Facebook, and TikTok, there is still limited research that focuses on the public discourse on Instagram in this context. Furthermore, the prior works in this field have only focused on the development and analysis of datasets of Instagram posts published during the first few months of the outbreak. The work presented in this paper aims to address this research gap and presents a novel multilingual dataset of 500,153 Instagram posts about COVID-19 published between January 2020 and September 2024. This dataset contains Instagram posts in 161 different languages. After the development of this dataset, multilingual sentiment analysis was performed using VADER and twitter-xlm-roberta-base-sentiment. This process involved classifying each post as positive, negative, or neutral. The results of sentiment analysis are presented as a separate attribute in this dataset.

For each of these posts, the Post ID, Post Description, Date of publication, language code, full version of the language, and sentiment label are presented as separate attributes in the dataset.

The Instagram posts in this dataset are present in 161 different languages out of which the top 10 languages in terms of frequency are English (343041 posts), Spanish (30220 posts), Hindi (15832 posts), Portuguese (15779 posts), Indonesian (11491 posts), Tamil (9592 posts), Arabic (9416 posts), German (7822 posts), Italian (5162 posts), Turkish (4632 posts)

There are 535,021 distinct hashtags in this dataset with the top 10 hashtags in terms of frequency being #covid19 (169865 posts), #covid (132485 posts), #coronavirus (117518 posts), #covid_19 (104069 posts), #covidtesting (95095 posts), #coronavirusupdates (75439 posts), #corona (39416 posts), #healthcare (38975 posts), #staysafe (36740 posts), #coronavirusoutbreak (34567 posts)

The following is a description of the attributes present in this dataset

Post ID: Unique ID of each Instagram post

Post Description: Complete description of each post in the language in which it was originally published

Date: Date of publication in MM/DD/YYYY format

Language code: Language code (for example: “en”) that represents the language of the post as detected using the Google Translate API

Full Language: Full form of the language (for example: “English”) that represents the language of the post as detected using the Google Translate API

Sentiment: Results of sentiment analysis (using the preprocessed version of each post) where each post was classified as positive, negative, or neutral

Open Research Questions

This dataset is expected to be helpful for the investigation of the following research questions and even beyond:

How does sentiment toward COVID-19 vary across different languages?

How has public sentiment toward COVID-19 evolved from 2020 to the present?

How do cultural differences affect social media discourse about COVID-19 across various languages?

How has COVID-19 impacted mental health, as reflected in social media posts across different languages?

How effective were public health campaigns in shifting public sentiment in different languages?

What patterns of vaccine hesitancy or support are present in different languages?

How did geopolitical events influence public sentiment about COVID-19 in multilingual social media discourse?

What role does social media discourse play in shaping public behavior toward COVID-19 in different linguistic communities?

How does the sentiment of minority or underrepresented languages compare to that of major world languages regarding COVID-19?

What insights can be gained by comparing the sentiment of COVID-19 posts in widely spoken languages (e.g., English, Spanish) to those in less common languages?

All the Instagram posts that were collected during this data mining process to develop this dataset were publicly available on Instagram and did not require a user to log in to Instagram to view the same (at the time of writing this paper).
s
120 Million Word Spanish Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
Explore at:
Dataset updated
Apr 24, 2020
Description
Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.
F
Mexican Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
e
GLips - German Lipreading Dataset - Dataset - B2FIND
b2find.eudat.eu
Updated May 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). GLips - German Lipreading Dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/aca4d0c4-9e81-560c-b3af-de011686ecc6
Explore at:
Dataset updated
May 1, 2023
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
The German Lipreading dataset consists of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language Lip Reading in the Wild (LRW) dataset, with each H264-compressed MPEG-4 video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. Choosing video material based on naturally spoken language in a natural environment ensures more robust results for real-world applications than artificially generated datasets with as little noise as possible. The 500 different spoken words ranging between 4-18 characters in length each have 500 instances and separate MPEG-4 audio- and text metadata-files, originating from 1018 parliamentary sessions. Additionally, the complete TextGrid files containing the segmentation information of those sessions are also included. The size of the uncompressed dataset is 16GB. Copyright of original data: Hessian Parliament (https://hessischer-landtag.de). If you use this dataset, you agree to use it for research purpose only and to cite the following reference in any works that make any use of the dataset.
A
‘Extinct Languages’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Extinct Languages’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-extinct-languages-6686/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Extinct Languages’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/the-guardian/extinct-languages on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

A recent Guardian blog post asks: "How many endangered languages are there in the World and what are the chances they will die out completely?" The United Nations Education, Scientific and Cultural Organisation (UNESCO) regularly publishes a list of endangered languages, using a classification system that describes its danger (or completion) of extinction.

Content

The full detailed dataset includes names of languages, number of speakers, the names of countries where the language is still spoken, and the degree of endangerment. The UNESCO endangerment classification is as follows:

Vulnerable: most children speak the language, but it may be restricted to certain domains (e.g., home)

Definitely endangered: children no longer learn the language as a 'mother tongue' in the home

Severely endangered: language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves

Critically endangered: the youngest speakers are grandparents and older, and they speak the language partially and infrequently

Extinct: there are no speakers left

Acknowledgements

Data was originally organized and published by The Guardian, and can be accessed via this Datablog post.

Inspiration

How can you best visualize this data?

Which rare languages are more isolated (Sicilian, for example) versus more spread out? Can you come up with a hypothesis for why that is the case?

Can you compare the number of rare speakers with more relatable figures? For example, are there more Romani speakers in the world than there are residents in a small city in the United States?

--- Original source retains full ownership of the source dataset ---
F
Colombian Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Colombian Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-colombia
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Colombian Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Colombian Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Colombian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Colombian Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Colombian Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Colombia to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Colombian Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Colombian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;

Facebook

Twitter

Click to copy link

Link copied

Cite

Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 1, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Marisol Brewster

Description

Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

Clear search

Close search

Google apps

Main menu

MCB_languages_county

Context

Content

Acknowledgements

‘Languages spoken across various nations’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

GlobalPhone Polish

jampatoisnli

XLingHealth

Examples

Words (Q)

Words (A)

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

Language spoken at Home (Census 2016)

GlobalPhone Portuguese (Brazilian)

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

Spanish (Spain) Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

English language corpora.

BanglaNLP

IndQNER: Indonesian Benchmark Dataset from the Indonesian Translation of the...

Pashtu Language Digits Dataset (PLDD)

GLips - German Lipreading Dataset

Data from: Five Years of COVID-19 Discourse on Instagram: A Labeled...

120 Million Word Spanish Corpus

Mexican Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

GLips - German Lipreading Dataset - Dataset - B2FIND

‘Extinct Languages’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Colombian Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Context

Content

Acknowledgements