100+ datasets found

MCB_languages_county
kaggle.com
Updated Oct 1, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2019
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Marisol Brewster
Description
Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash
A
‘Languages spoken across various nations’ analyzed by Analyst-2
analyst-2.ai
Updated Feb 13, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Languages spoken across various nations’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-languages-spoken-across-various-nations-a8e8/latest
Explore at:
Dataset updated
Feb 13, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Languages spoken across various nations’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/shubhamptrivedi/languages-spoken-across-various-nations on 13 February 2022.

--- Dataset description provided by original source is as follows ---

Context

I was fascinated by this type of data as this gives a slight peek on cultural diversity of a nation and what kind of literary work to be expected from that nation

Content

This dataset is a collection of all the languages that are spoken by the different nations around the world. Nowadays, Most nations are bi or even trilingual in nature this can be due to different cultures and different groups of people are living in the same nation in harmony. This type of data can be very useful for linguistic research, market research, advertising purposes, and the list goes on.

Acknowledgements

This dataset was published on the site Infoplease which is a general information website.

Inspiration

I think this dataset can be useful to understand which type of literature publication can be done for maximum penetration of the market base

--- Original source retains full ownership of the source dataset ---
English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
Language spoken at Home (Census 2016)
digital-earth-pacificcore.hub.arcgis.com
pacificgeoportal.com
+2more
Updated May 26, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Esri Australia (2019). Language spoken at Home (Census 2016) [Dataset]. https://digital-earth-pacificcore.hub.arcgis.com/datasets/esriau::language-spoken-at-home-census-2016/about
Explore at:
Dataset updated
May 26, 2019
Dataset provided by
Esrihttp://esri.com/
Authors
Esri Australia
Description
Does the person speak a language other than English at home? This map takes a look at answers to this question from Census Night.Colour:For each SA1 geography, the colour indicates which language 'wins'.SA1 geographies not coloured are either tied between two languages or not enough data Colour Intensity:The colour intensity compares the values of the winner to all other values and returns its dominance over other languages in the same geographyNotes:Only considers top 6 languages for VICCensus 2016 DataPacksPredominance VisualisationsSource CodeNotice that while one language level appears to dominate certain geographies, it doesn't necessarily mean it represents the majority of the population. In fact, as you explore most areas, you will find the predominant language makes up just a fraction of the population due to the number of languages considered.
E
GlobalPhone Vietnamese
live.european-language-grid.eu
catalogue.elra.info
audio format
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GlobalPhone Vietnamese [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/2100
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Description
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].

[Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002. [Vu and Schultz, 2010] Ngoc Thang Vu, Tanja Schultz (2010): Optimization On Vietnamese Large Vocabulary Speech Recognition, 2nd Workshop on Spoken Languages Technologies for Under-resourced Languages, SLTU 2010, Penang, Malaysia, May 2010. [Vu and Schultz, 2009] Ngoc Thang Vu, Tanja Schultz (2009): Vietnamese Large Vocabulary Continuous Speech Recognition, Automatic Speech Recognition and Understanding, ASRU 2009, Merano.
Global Country Information 2023
zenodo.org
data.niaid.nih.gov
csv
Updated Jun 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nidula Elgiriyewithana; Nidula Elgiriyewithana (2024). Global Country Information 2023 [Dataset]. http://doi.org/10.5281/zenodo.8165229
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8165229
Dataset updated
Jun 15, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nidula Elgiriyewithana; Nidula Elgiriyewithana
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description

This comprehensive dataset provides a wealth of information about all countries worldwide, covering a wide range of indicators and attributes. It encompasses demographic statistics, economic indicators, environmental factors, healthcare metrics, education statistics, and much more. With every country represented, this dataset offers a complete global perspective on various aspects of nations, enabling in-depth analyses and cross-country comparisons.

Key Features

Country: Name of the country.

Density (P/Km2): Population density measured in persons per square kilometer.

Abbreviation: Abbreviation or code representing the country.

Agricultural Land (%): Percentage of land area used for agricultural purposes.

Land Area (Km2): Total land area of the country in square kilometers.

Armed Forces Size: Size of the armed forces in the country.

Birth Rate: Number of births per 1,000 population per year.

Calling Code: International calling code for the country.

Capital/Major City: Name of the capital or major city.

CO2 Emissions: Carbon dioxide emissions in tons.

CPI: Consumer Price Index, a measure of inflation and purchasing power.

CPI Change (%): Percentage change in the Consumer Price Index compared to the previous year.

Currency_Code: Currency code used in the country.

Fertility Rate: Average number of children born to a woman during her lifetime.

Forested Area (%): Percentage of land area covered by forests.

Gasoline_Price: Price of gasoline per liter in local currency.

GDP: Gross Domestic Product, the total value of goods and services produced in the country.

Gross Primary Education Enrollment (%): Gross enrollment ratio for primary education.

Gross Tertiary Education Enrollment (%): Gross enrollment ratio for tertiary education.

Infant Mortality: Number of deaths per 1,000 live births before reaching one year of age.

Largest City: Name of the country's largest city.

Life Expectancy: Average number of years a newborn is expected to live.

Maternal Mortality Ratio: Number of maternal deaths per 100,000 live births.

Minimum Wage: Minimum wage level in local currency.

Official Language: Official language(s) spoken in the country.

Out of Pocket Health Expenditure (%): Percentage of total health expenditure paid out-of-pocket by individuals.

Physicians per Thousand: Number of physicians per thousand people.

Population: Total population of the country.

Population: Labor Force Participation (%): Percentage of the population that is part of the labor force.

Tax Revenue (%): Tax revenue as a percentage of GDP.

Total Tax Rate: Overall tax burden as a percentage of commercial profits.

Unemployment Rate: Percentage of the labor force that is unemployed.

Urban Population: Percentage of the population living in urban areas.

Latitude: Latitude coordinate of the country's location.

Longitude: Longitude coordinate of the country's location.

Potential Use Cases

Analyze population density and land area to study spatial distribution patterns.

Investigate the relationship between agricultural land and food security.

Examine carbon dioxide emissions and their impact on climate change.

Explore correlations between economic indicators such as GDP and various socio-economic factors.

Investigate educational enrollment rates and their implications for human capital development.

Analyze healthcare metrics such as infant mortality and life expectancy to assess overall well-being.

Study labor market dynamics through indicators such as labor force participation and unemployment rates.

Investigate the role of taxation and its impact on economic development.

Explore urbanization trends and their social and environmental consequences.
s
Language
png-data.sprep.org
pacificdata.org
+1more
pdf
Updated Nov 2, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PNG Department of Education (2022). Language [Dataset]. https://png-data.sprep.org/dataset/language
Explore at:
pdf(509319)Available download formats
Dataset updated
Nov 2, 2022
Dataset provided by
PNG Department of Education
License
Public Domain Mark 1.0https://creativecommons.org/publicdomain/mark/1.0/
License information was derived automatically
Area covered
Papua New Guinea
Description
How can linguistics contribute to our knowledge about human dispersals in the distant past? We will consider the case of New Guinea and surrounding islands, one of the most linguistically diverse areas of the world. This study is a follow-up on the Eurocores OMLL project Pioneers of Island Melanesia, reported in Dunn et al. (2005).

A possible scenario would assume at least two major migration (Summerhayes 2007, see above) waves through Wallacea into Sahul, perhaps the oldest one, ~40,000 BP, following the northern route (Sulawesi, Halmahera, Bird’s Head and further to the east along the north coast), the ancestors of non-TNG, and a second one, ~20,000 BP, through the Lesser Sundas directly onto present-day north Australia and Aru island, with a northward trek into the Highlands, the ancestors of TNG. This scenario would have the TAP and, possibly, the South Papuan families as stay-behind descendants of the TNG precursors.
h
jampatoisnli
huggingface.co
Updated Jul 21, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ruth-Ann Armstrong (2023). jampatoisnli [Dataset]. https://huggingface.co/datasets/Ruth-Ann/jampatoisnli
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 21, 2023
Authors
Ruth-Ann Armstrong
License
https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/
Description
Dataset Card for [Dataset Name]

Dataset Summary

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the… See the full description on the dataset page: https://huggingface.co/datasets/Ruth-Ann/jampatoisnli.
F
Mexican Spanish Call Center Data for Realestate AI
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish Call Center Data for Realestate AI [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/realestate-call-center-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
This Mexican Spanish Call Center Speech Dataset for the Real Estate industry is purpose-built to accelerate the development of speech recognition, spoken language understanding, and conversational AI systems tailored for Spanish -speaking Real Estate customers. With over 30 hours of unscripted, real-world audio, this dataset captures authentic conversations between customers and real estate agents ideal for building robust ASR models.
Curated by FutureBeeAI, this dataset equips voice AI developers, real estate tech platforms, and NLP researchers with the data needed to create high-accuracy, production-ready models for property-focused use cases.
Speech Data
The dataset features 30 hours of dual-channel call center recordings between native Mexican Spanish speakers. Captured in realistic real estate consultation and support contexts, these conversations span a wide array of property-related topics from inquiries to investment advice offering deep domain coverage for AI model development.
•Participant Diversity:
•
Speakers: 60 native Mexican Spanish speakers from our verified contributor community.

•
Regions: Representing different provinces across Mexico to ensure accent and dialect variation.

•
Participant Profile: Balanced gender mix (60% male, 40% female) and age range from 18 to 70.

•Recording Details:
•
Conversation Nature: Naturally flowing, unscripted agent-customer discussions.

•
Call Duration: Average 5–15 minutes per call.

•
Audio Format: Stereo WAV, 16-bit, recorded at 8kHz and 16kHz.

•
Recording Environment: Captured in noise-free and echo-free conditions.

Topic Diversity
This speech corpus includes both inbound and outbound calls, featuring positive, neutral, and negative outcomes across a wide range of real estate scenarios.
•Inbound Calls:
•Property Inquiries
•Rental Availability
•Renovation Consultation
•Property Features & Amenities
•Investment Property Evaluation
•Ownership History & Legal Info, and more
•Outbound Calls:
•New Listing Notifications
•Post-Purchase Follow-ups
•Property Recommendations
•Value Updates
•Customer Satisfaction Surveys, and others
Such domain-rich variety ensures model generalization across common real estate support conversations.
Transcription
All recordings are accompanied by precise, manually verified transcriptions in JSON format.
•Transcription Includes:
•Speaker-Segmented Dialogues
•Time-coded Segments
•Non-speech Tags (e.g., background noise, pauses)
•High transcription accuracy with word error rate below 5% via dual-layer human review.
These transcriptions streamline ASR and NLP development for Spanish real estate voice applications.
Metadata
Detailed metadata accompanies each participant and conversation:
•
Participant Metadata: ID, age, gender, location, accent, and dialect.

•
Conversation Metadata: Topic, call type, sentiment, sample rate, and technical details.

This enables smart filtering, dialect-focused model training, and structured dataset exploration.
Usage and Applications
This dataset is ideal for voice AI and NLP systems built for the real estate sector:
<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:
Global Freelancers (Raw) Dataset
kaggle.com
Updated Jul 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Urvish Ahir (2025). Global Freelancers (Raw) Dataset [Dataset]. https://www.kaggle.com/datasets/urvishahir/global-freelancers-raw-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 5, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Urvish Ahir
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Description :

This dataset contains 1,000 fictional freelancer profiles from around the world, designed to reflect realistic variability and messiness often encountered in real-world data collection.

Each entry includes demographic, professional, and platform-related information such as:

Name, gender, age, and country

Primary skill and years of experience

Hourly rate (with mixed formatting), client rating, and satisfaction score

Language spoken (based on country)

Inconsistent and unclean values across several fields (e.g., gender, is_active, satisfaction)

Key Features :

Gender-based names using Faker’s male/female name generators

Realistic age and experience distribution (with missing and noisy values)

Country-language pairs mapped using actual linguistic data

Messy formatting: mixed data types, missing values, inconsistent casing

Generated entirely in Python using the faker library no real data used

Use Cases :

Practicing data cleaning and preprocessing

Performing EDA (Exploratory Data Analysis)

Developing data pipelines: raw → clean → model-ready

Teaching feature engineering and handling real-world dirty data

Exercises in data validation, outlier detection, and format standardization

File : global_freelancers_raw.csv

| Column Name | Description | | --------------------- | ------------------------------------------------------------------------ | | `freelancer_ID` | Unique ID starting with `FL` (e.g., FL250001) | | `name` | Full name of freelancer (based on gender) | | `gender` | Gender (messy values and case inconsistency) | | `age` | Age of the freelancer (20–60, with occasional nulls/outliers) | | `country` | Country name (with random formatting/casing) | | `language` | Language spoken (mapped from country) | | `primary_skill` | Key freelance domain (e.g., Web Dev, AI, Cybersecurity) | | `years_of_experience` | Work experience in years (some missing values or odd values included) | | `hourly_rate (USD)` | Hourly rate with currency symbols or missing data | | `rating` | Rating between 1.0–5.0 (some zeros and nulls included) | | `is_active` | Active status (inconsistently represented as strings, numbers, booleans) | | `client_satisfaction` | Satisfaction percentage (e.g., "85%" or 85, may include NaNs) |
h
Gamayun-kits
huggingface.co
Updated Dec 26, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CLEAR Global (2023). Gamayun-kits [Dataset]. https://huggingface.co/datasets/CLEAR-Global/Gamayun-kits
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 26, 2023
Dataset authored and provided by
CLEAR Global
Description
Gamayun Language Data Kits

There are more than 7,000 languages in the world, yet only a small proportion of them have language data presence in public. CLEAR Global's Gamayun kits are a starting point for developing audio and text corpora for languages without pre-existing data resources. We create parallel data for a language by translating a pre-compiled set of general-domain sentences in English. If audio data is needed, these translated sentences are recorded by native speakers.… See the full description on the dataset page: https://huggingface.co/datasets/CLEAR-Global/Gamayun-kits.
F
Mexican Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
h
XLingHealth
huggingface.co
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georgia Tech CLAWS Lab (2024). XLingHealth [Dataset]. https://huggingface.co/datasets/claws-lab/XLingHealth
Explore at:
Dataset updated
Feb 7, 2024
Dataset authored and provided by
Georgia Tech CLAWS Lab
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for "XLingHealth"

XLingHealth is a Cross-Lingual Healthcare benchmark for clinical health inquiry that features the top four most spoken languages in the world: English, Spanish, Chinese, and Hindi.

Statistics

Dataset

Examples

Words (Q)

Words (A)

HealthQA 1,134 7.72 ± 2.41 242.85 ± 221.88

LiveQA 246 41.76 ± 37.38 115.25 ± 112.75

MedicationQA 690 6.86 ± 2.83 61.50 ± 69.44

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.
m
Pashtu Language Digits Dataset (PLDD)
data.mendeley.com
Updated Mar 25, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
khalil khan (2022). Pashtu Language Digits Dataset (PLDD) [Dataset]. http://doi.org/10.17632/zbyc7sgp63.2
Explore at:
Unique identifier
https://doi.org/10.17632/zbyc7sgp63.2
Dataset updated
Mar 25, 2022
Authors
khalil khan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Pashtu is a language spoken by more than 50 million people in the world. It is also the national language of Afghanistan. In the two largest provinces of Pakistan (Khyber Pakhtun Khwa and Baluchistan) Pashtu is also spoken. Although the optical character recognition system of the other languages is in very developed form, for the Pashtu language very rare work has been reported. As in the initial step, we are introducing this dataset for digits recognition.
A
‘Extinct Languages’ analyzed by Analyst-2
analyst-2.ai
Updated Jan 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2022). ‘Extinct Languages’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-extinct-languages-6686/latest
Explore at:
Dataset updated
Jan 28, 2022
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Extinct Languages’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/the-guardian/extinct-languages on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Context

A recent Guardian blog post asks: "How many endangered languages are there in the World and what are the chances they will die out completely?" The United Nations Education, Scientific and Cultural Organisation (UNESCO) regularly publishes a list of endangered languages, using a classification system that describes its danger (or completion) of extinction.

Content

The full detailed dataset includes names of languages, number of speakers, the names of countries where the language is still spoken, and the degree of endangerment. The UNESCO endangerment classification is as follows:

Vulnerable: most children speak the language, but it may be restricted to certain domains (e.g., home)

Definitely endangered: children no longer learn the language as a 'mother tongue' in the home

Severely endangered: language is spoken by grandparents and older generations; while the parent generation may understand it, they do not speak it to children or among themselves

Critically endangered: the youngest speakers are grandparents and older, and they speak the language partially and infrequently

Extinct: there are no speakers left

Acknowledgements

Data was originally organized and published by The Guardian, and can be accessed via this Datablog post.

Inspiration

How can you best visualize this data?

Which rare languages are more isolated (Sicilian, for example) versus more spread out? Can you come up with a hypothesis for why that is the case?

Can you compare the number of rare speakers with more relatable figures? For example, are there more Romani speakers in the world than there are residents in a small city in the United States?

--- Original source retains full ownership of the source dataset ---
P
Multilingual Sentiment Datasets Dataset
paperswithcode.com
Updated Jan 2, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Multilingual Sentiment Datasets Dataset [Dataset]. https://paperswithcode.com/dataset/multilingual-sentiment-datasets
Explore at:
Dataset updated
Jan 2, 2024
Description
A collection of multilingual sentiment datasets grouped into 3 classes -- positive, neutral, and negative.

Most multilingual sentiment datasets are either 2-class positive or negative, 5-class ratings of product reviews (e.g. Amazon multilingual dataset), or multiple classes of emotions. However, to an average person, sometimes positive, negative, and neutral classes suffice and are more straightforward to perceive and annotate. Also, a positive/negative classification is too naive, most of the text in the world is neutral in sentiment. Furthermore, most multilingual sentiment datasets don't include Asian languages (e.g. Malay, Indonesian) and are dominated by Western languages (e.g. English, German).

For emotions-related datasets, I group the negative (respectively positive) emotions into the negative (respectively positive) class. For rating datasets I assign 1-star reviews to the negative class, 3-star reviews to the neutral class, and assign 5-star reviews to the positive class.
F
German General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). German General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-german-germany
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the German General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of German speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world German communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade German speech models that understand and respond to authentic German accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of German. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native German speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Germany to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple German speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for German.

•
Voice Assistants: Build smart assistants capable of understanding natural German conversations.

<span
h
Saraiki-Language-Character-Dataset
huggingface.co
data.mendeley.com
Updated Jun 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taha Arif (2025). Saraiki-Language-Character-Dataset [Dataset]. https://huggingface.co/datasets/tahaListens/Saraiki-Language-Character-Dataset
Explore at:
Dataset updated
Jun 21, 2025
Authors
Taha Arif
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was contributed by Muhammad Ahmad Khan Khan, Muhammad Ahmad (2023), “Saraiki Language Character Dataset”, Mendeley Data, V1, doi: 10.17632/tc9zv2wf2k.1 https://data.mendeley.com/datasets/tc9zv2wf2k/1 Over 26 million people speak Saraiki worldwide, with a concentration in South Punjab and a few districts in Sindh. Calligraphers write Saraiki in an extremely complex manner. For the most part, most languages in the world have highly developed optical character recognition systems;… See the full description on the dataset page: https://huggingface.co/datasets/tahaListens/Saraiki-Language-Character-Dataset.
Duolingo Spaced Repetition Data
kaggle.com
Updated Feb 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vinicius Araujo (2024). Duolingo Spaced Repetition Data [Dataset]. https://www.kaggle.com/datasets/aravinii/duolingo-spaced-repetition-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vinicius Araujo
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
PLEASE UPVOTE IF YOU LIKE THIS CONTENT! 😍

Duolingo is an American educational technology company that produces learning apps and provides language certification. There main app is considered the most popular language learning app in the world.

To progress in their learning journey, each user of the application needs to complete a set of lessons in which they are presented with the words of the language they want to learn. In an infinite set of lessons, each word is applied in a different context and, on top of that, Duolingo uses a spaced repetition approach, where the user sees an already known word again to reinforce their learning.

Each line in this file refers to a Duolingo lesson that had a target word to practice.

The columns are as follows:

p_recall - proportion of exercises from this lesson/practice where the word/lexeme was correctly recalled

timestamp - UNIX timestamp of the current lesson/practice

delta - time (in seconds) since the last lesson/practice that included this word/lexeme

user_id - student user ID who did the lesson/practice (anonymized)

learning_language - language being learned

ui_language - user interface language (presumably native to the student)

lexeme_id - system ID for the lexeme tag (i.e., word)

lexeme_string - lexeme tag (see below)

history_seen - total times user has seen the word/lexeme prior to this lesson/practice

history_correct - total times user has been correct for the word/lexeme prior to this lesson/practice

session_seen - times the user saw the word/lexeme during this lesson/practice

session_correct - times the user got the word/lexeme correct during this lesson/practice

The lexeme_string column contains a string representation of the "lexeme tag" used by Duolingo for each lesson/practice (data instance) in our experiments. The lexeme_string field uses the following format:

`surface-form/lemma
P
MML Dataset
paperswithcode.com
Updated Feb 15, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt (2022). MML Dataset [Dataset]. https://paperswithcode.com/dataset/mmlu
Explore at:
Dataset updated
Feb 15, 2022
Authors
Dan Hendrycks; Collin Burns; Steven Basart; Andy Zou; Mantas Mazeika; Dawn Song; Jacob Steinhardt
Description
MMLU (Massive Multitask Language Understanding) is a new benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the benchmark more challenging and more similar to how we evaluate humans. The benchmark covers 57 subjects across STEM, the humanities, the social sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and it tests both world knowledge and problem solving ability. Subjects range from traditional areas, such as mathematics and history, to more specialized areas like law and ethics. The granularity and breadth of the subjects makes the benchmark ideal for identifying a model’s blind spots.

Facebook

Twitter

Click to copy link

Link copied

Cite

Marisol Brewster (2019). MCB_languages_county [Dataset]. https://www.kaggle.com/mcbrewster/mcb-languages-county/code

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 1, 2019

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Marisol Brewster

Description

Context

This is a dataset I found online through the Google Dataset Search portal.

Content

The American Community Survey (ACS) 2009-2013 multi-year data are used to list all languages spoken in the United States that were reported during the sample period. These tables provide detailed counts of many more languages than the 39 languages and language groups that are published annually as a part of the routine ACS data release. This is the second tabulation beyond 39 languages since ACS began.

The tables include all languages that were reported in each geography during the 2009 to 2013 sampling period. For the purpose of tabulation, reported languages are classified in one of 380 possible languages or language groups. Because the data are a sample of the total population, there may be languages spoken that are not reported, either because the ACS did not sample the households where those languages are spoken, or because the person filling out the survey did not report the language or reported another language instead.

The tables also provide information about self-reported English-speaking ability. Respondents who reported speaking a language other than English were asked to indicate their ability to speak English in one of the following categories: "Very well," "Well," "Not well," or "Not at all." The data on ability to speak English represent the person’s own perception about his or her own ability or, because ACS questionnaires are usually completed by one household member, the responses may represent the perception of another household member.

These tables are also available through the Census Bureau's application programming interface (API). Please see the developers page for additional details on how to use the API to access these data.

Acknowledgements

Sources:

Google Dataset Search: https://toolbox.google.com/datasetsearch

2009-2013 American Community Survey

Original dataset: https://www.census.gov/data/tables/2013/demo/2009-2013-lang-tables.html

Downloaded From: https://data.world/kvaughn/languages-county

Banner and thumbnail photo by Farzad Mohsenvand on Unsplash

Clear search

Close search

Google apps

Main menu

MCB_languages_county

Context

Content

Acknowledgements

‘Languages spoken across various nations’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

Language spoken at Home (Census 2016)

GlobalPhone Vietnamese

Global Country Information 2023

Language

jampatoisnli

Mexican Spanish Call Center Data for Realestate AI

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Global Freelancers (Raw) Dataset

Description :

Key Features :

Use Cases :

File : global_freelancers_raw.csv

Gamayun-kits

Mexican Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

XLingHealth

Examples

Words (Q)

Words (A)

Words (Q) and #Words (A) represent the average number of words… See the full description on the dataset page: https://huggingface.co/datasets/claws-lab/XLingHealth.

Pashtu Language Digits Dataset (PLDD)

‘Extinct Languages’ analyzed by Analyst-2

Context

Content

Acknowledgements

Inspiration

Multilingual Sentiment Datasets Dataset

German General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Saraiki-Language-Character-Dataset

Duolingo Spaced Repetition Data

MML Dataset

MCB_languages_county

This dataset was used to list all languages spoken in the United States: 2009-13

Context

Content

Acknowledgements

File : `global_freelancers_raw.csv`