88 datasets found

F
Mexican Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;
F
US Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). US Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-us-spanish
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United States
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the US Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic US accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native US Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of USA to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for US Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural US conversations.

<span
s
Data from: Spanish (Mexico) Dataset
shaip.com
hmn.shaip.com
Updated Feb 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). Spanish (Mexico) Dataset [Dataset]. https://www.shaip.com/offerings/speech-data-catalog/spanish-mexico-dataset/
Explore at:
Dataset updated
Feb 20, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Mexico
Description
Home Spanish (Mexico) DatasetConjunto de datos español (México)High-Quality Spanish Mexico TTS Dataset for AI & Speech Models Contact Us OverviewTitleSpanish (Mexico) Language DatasetDataset TypeTTSDescriptionSingle-utterance recordings, which tend to fall in…
Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Colombia, Paraguay, Ecuador, Honduras, Bolivia (Plurinational State of), Chile, Nicaragua, Costa Rica, Cuba, Panama
Description
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Spanish Word List Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Headwords: 73,000

Senses: 123,000

Sentence examples: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
u
Data from: IA Tweets Analysis Dataset (Spanish)
produccioncientifica.uca.es
Updated 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés (2024). IA Tweets Analysis Dataset (Spanish) [Dataset]. https://produccioncientifica.uca.es/documentos/67321e53aea56d4af04854c2
Explore at:
Dataset updated
2024
Authors
Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés; Guerrero-Contreras, Gabriel; Balderas-Díaz, Sara; Serrano-Fernández, Alejandro; Muñoz, Andrés
Description
Cite as

Guerrero-Contreras, G., Balderas-Díaz, S., Serrano-Fernández, A., & Muñoz, A. (2024, June). Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In 2024 International Conference on Intelligent Environments (IE) (pp. 62-69). IEEE.

General Description

This dataset comprises 4,038 tweets in Spanish, related to discussions about artificial intelligence (AI), and was created and utilized in the publication "Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights," (10.1109/IE61493.2024.10599899) presented at the 20th International Conference on Intelligent Environments. It is designed to support research on public perception, sentiment, and engagement with AI topics on social media from a Spanish-speaking perspective. Each entry includes detailed annotations covering sentiment analysis, user engagement metrics, and user profile characteristics, among others.

Data Collection Method

Tweets were gathered through the Twitter API v1.1 by targeting keywords and hashtags associated with artificial intelligence, focusing specifically on content in Spanish. The dataset captures a wide array of discussions, offering a holistic view of the Spanish-speaking public's sentiment towards AI.

Dataset Content

ID: A unique identifier for each tweet.

text: The textual content of the tweet. It is a string with a maximum allowed length of 280 characters.

polarity: The tweet's sentiment polarity (e.g., Positive, Negative, Neutral).

favorite_count: Indicates how many times the tweet has been liked by Twitter users. It is a non-negative integer.

retweet_count: The number of times this tweet has been retweeted. It is a non-negative integer.

user_verified: When true, indicates that the user has a verified account, which helps the public recognize the authenticity of accounts of public interest. It is a boolean data type with two allowed values: True or False.

user_default_profile: When true, indicates that the user has not altered the theme or background of their user profile. It is a boolean data type with two allowed values: True or False.

user_has_extended_profile: When true, indicates that the user has an extended profile. An extended profile on Twitter allows users to provide more detailed information about themselves, such as an extended biography, a header image, details about their location, website, and other additional data. It is a boolean data type with two allowed values: True or False.

user_followers_count: The current number of followers the account has. It is a non-negative integer.

user_friends_count: The number of users that the account is following. It is a non-negative integer.

user_favourites_count: The number of tweets this user has liked since the account was created. It is a non-negative integer.

user_statuses_count: The number of tweets (including retweets) posted by the user. It is a non-negative integer.

user_protected: When true, indicates that this user has chosen to protect their tweets, meaning their tweets are not publicly visible without their permission. It is a boolean data type with two allowed values: True or False.

user_is_translator: When true, indicates that the user posting the tweet is a verified translator on Twitter. This means they have been recognized and validated by the platform as translators of content in different languages. It is a boolean data type with two allowed values: True or False.

Potential Use Cases

This dataset is aimed at academic researchers and practitioners with interests in:

Sentiment analysis and natural language processing (NLP) with a focus on AI discussions in the Spanish language.

Social media analysis on public engagement and perception of artificial intelligence among Spanish speakers.

Exploring correlations between user engagement metrics and sentiment in discussions about AI.

Data Format and File Type

The dataset is provided in CSV format, ensuring compatibility with a wide range of data analysis tools and programming environments.

License

The dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, permitting sharing, copying, distribution, transmission, and adaptation of the work for any purpose, including commercial, provided proper attribution is given.
h
EpaDB
huggingface.co
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Koel Labs (2025). EpaDB [Dataset]. https://huggingface.co/datasets/KoelLabs/EpaDB
Explore at:
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Koel Labs
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
EpaDB

EpaDB is a speech database of 50 native Spanish speakers (25 male, 25 female) from Argentina speaking English. It contains phonemic annotations using mainly the sounds supported by ARPABet with a few extensions to model Spanish influenced dialects of English. It was developed by Jazmin Vidal, Luciana Ferrer, and Leonardo Brambilla at the Speech Lab. Read more on their official github and paper.

This Processed Version

We have processed the dataset into an easily… See the full description on the dataset page: https://huggingface.co/datasets/KoelLabs/EpaDB.
F
Colombian Spanish General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). Colombian Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-colombia
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Colombian Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Colombian Spanish communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Colombian accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Colombian Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native Colombian Spanish speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of Colombia to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple Spanish speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for Colombian Spanish.

•
Voice Assistants: Build smart assistants capable of understanding natural Colombian conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex;
e
Dictionary of Spanish Language 22 ed. (2001) - DLE22 (ELEXIS) - Dataset -...
b2find.eudat.eu
Updated Apr 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Dictionary of Spanish Language 22 ed. (2001) - DLE22 (ELEXIS) - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/9b9efbb0-942c-57b9-b555-e38492b83cb8
Explore at:
Dataset updated
Apr 29, 2023
Description
Diccionario de la lengua española 22 ed. (2001). The Diccionario de la lengua española is the standard dictionary of Spanish (a.k.a. Castilian) edited and produced by the Royal Spanish Academy (RAE). Its first edition dates from 1780, and its latest one is the 23rd edition published in 2014. The online version is comprised of the 22nd edition plus some of the work done for the 23rd edition. DLE is considered the most authoritative dictionary for the Spanish language. It includes commonly used words in any of the Spanish speaking countries. It also includes numerous archaic and unusual words with aims of understanding ancient Spanish literature.
s
120 Million Word Spanish Corpus
marketplace.sshopencloud.eu
Updated Apr 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). 120 Million Word Spanish Corpus [Dataset]. https://marketplace.sshopencloud.eu/dataset/XTUFXt
Explore at:
Dataset updated
Apr 24, 2020
Description
Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.
E
SALA Spanish Venezuelan Database
catalogue.elra.info
live.european-language-grid.eu
Updated Dec 14, 2009
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2009). SALA Spanish Venezuelan Database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0141/
Explore at:
Dataset updated
Dec 14, 2009
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
Area covered
Venezuela
Description
The SALA Spanish Venezuelan database contains the recordings of 1,000 Venezuelan speakers (504 males, 496 females) recorded over the Venezuelan fixed telephone network. This database is partitioned into 5 CD-ROMs The speech files are stored as sequences of 8-bit, 8kHz mu-law speech files and are not compressed, according to the specifications of SALA. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA format and content specifications.Each speaker uttered the following items: * 6 application words * 1 sequence of 10 isolated digits * 4 connected digits (1 sheet number -6 digits, 1 telephone number –9/11 digits, 1 credit card number –14/16 digits, 1 PIN code -6 digits) * 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression) * 1 spotting phrase using an embedded application word * 1 isolated digit * 3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage) * 1 currency money amount * 1 natural number * 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 "forename surname" out of a set of 150 ) * 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question) * 9 phonetically rich sentences * 1 additional sentence * 2 time phrases (1 spontaneous time of day, 1word style time phrase) * 4 phonetically rich wordsThe following age distribution has been obtained: 7 speakers are under 16, 476 speakers are between 16 and 30, 330 speakers are between 31 and 45, 177 speakers are between 46 and 60, and 10 speakers are over 60.
Census Data - Languages spoken in Chicago, 2008 – 2012
data.cityofchicago.org
healthdata.gov
+4more
csv, xlsx, xml
Updated Sep 12, 2014
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2014). Census Data - Languages spoken in Chicago, 2008 – 2012 [Dataset]. https://data.cityofchicago.org/Health-Human-Services/Census-Data-Languages-spoken-in-Chicago-2008-2012/a2fk-ec6q
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Sep 12, 2014
Dataset provided by
United States Census Bureauhttp://census.gov/
Authors
U.S. Census Bureau
Area covered
Chicago
Description
This dataset contains estimates of the number of residents aged 5 years or older in Chicago who “speak English less than very well,” by the non-English language spoken at home and community area of residence, for the years 2008 – 2012. See the full dataset description for more information at: https://data.cityofchicago.org/api/views/fpup-mc9v/files/dK6ZKRQZJ7XEugvUavf5MNrGNW11AjdWw0vkpj9EGjg?download=true&filename=P:\EPI\OEPHI\MATERIALS\REFERENCES\ECONOMIC_INDICATORS\Dataset_Description_Languages_2012_FOR_PORTAL_ONLY.pdf
N
Speaker Township, Michigan Non-Hispanic Population Breakdown By Race...
neilsberg.com
csv, json
Updated Feb 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). Speaker Township, Michigan Non-Hispanic Population Breakdown By Race Dataset: Non-Hispanic Population Counts and Percentages for 7 Racial Categories as Identified by the US Census Bureau // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/9a0ae989-ef82-11ef-9e71-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 21, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Michigan, Speaker Township, Speaker
Variables measured
Non-Hispanic Asian Population, Non-Hispanic Black Population, Non-Hispanic White Population, Non-Hispanic Some other race Population, Non-Hispanic Two or more races Population, Non-Hispanic American Indian and Alaska Native Population, Non-Hispanic Native Hawaiian and Other Pacific Islander Population, Non-Hispanic Asian Population as Percent of Total Non-Hispanic Population, Non-Hispanic Black Population as Percent of Total Non-Hispanic Population, Non-Hispanic White Population as Percent of Total Non-Hispanic Population, and 4 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2017-2021 5-Year Estimates. To measure the two variables, namely (a) Non-Hispanic population and (b) population as a percentage of the total Non-Hispanic population, we initially analyzed and categorized the data for each of the racial categories idetified by the US Census Bureau. It is ensured that the population estimates used in this dataset pertain exclusively to the identified racial categories, and are part of Non-Hispanic classification. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the Non-Hispanic population of Speaker township by race. It includes the distribution of the Non-Hispanic population of Speaker township across various race categories as identified by the Census Bureau. The dataset can be utilized to understand the Non-Hispanic population distribution of Speaker township across relevant racial categories.

Key observations

Of the Non-Hispanic population in Speaker township, the largest racial group is White alone with a population of 1,337 (94.89% of the total Non-Hispanic population).

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Racial categories include:

White

Black or African American

American Indian and Alaska Native

Asian

Native Hawaiian and Other Pacific Islander

Some other race

Two or more races (multiracial)

Variables / Data Columns

Race: This column displays the racial categories (for Non-Hispanic) for the Speaker township

Population: The population of the racial category (for Non-Hispanic) in the Speaker township is shown in this column.

% of Total Population: This column displays the percentage distribution of each race as a proportion of Speaker township total Non-Hispanic population. Please note that the sum of all percentages may not equal one due to rounding of values.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for Speaker township Population by Race & Ethnicity. You can refer the same here
E
SALA Spanish Mexican Database
live.european-language-grid.eu
catalogue.elra.info
audio format
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SALA Spanish Mexican Database [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/1885
Explore at:
audio formatAvailable download formats
License
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
http://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttp://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
Mexico
Description
The SALA Spanish Mexican Database comprises 1260 Mexican speakers (554 males, 706 females) recorded over the Mexican fixed telephone network. This database is partitioned into 7 CD-ROMs The speech databases made within the SALA project were validated by SPEX, the Netherlands, to assess their compliance with the SALA format and content specifications.

The speech files are stored as sequences of 8-bit, 8kHz A-law speech files and are not compressed, according to the specifications of SALA. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file. Each speaker uttered the following items:

* 6 application words; * 1 sequence of 10 isolated digits; * 4 connected digits: 1 sheet number (6 digits), 1 telephone number (9-11 digits), 1 credit card number (14-16 digits), 1 PIN code (6 digits); * 3 dates: 1 spontaneous date (e.g. birthday), 1 prompted date (word style), 1 relative and general date expression; * 1 spotting phrase using an application word (embedded); * 1 isolated digit; * 3 spelled-out words (letter sequences): 1 spelling of surname; 1 spelling of directory assistance city name; 1 real/artificial name for coverage; * 1 currency money amount; * 1 natural number; * 5 directory assistance names: 1 surname (out of 500); 1 city of birth / growing up (spontaneous); 1 most frequent city (out of 500); 1 most frequent company/agency (out of 500); 1 "forename surname" (set of 150 ) * 2 questions, including "fuzzy" yes/no: 1 predominantly "yes" question, 1 predominantly "no" question; * 9 phonetically rich sentences; * 9 additional spontaneous items * 2 time phrases: 1 time of day (spontaneous), 1 time phrase (word style); * 4 phonetically rich words.

The following age distribution has been obtained: 20 speakers are under 16 years old, 801 speakers are between 16 and 30, 291 speakers are between 31 and 45, 124 speakers are between 46 and 60, and 24 speakers are over 60. A phonetic lexicon with canonical transcriptions in SAMPA is also provided.
2013 American Community Survey - Table Packages: Detailed Language Spoken in...
catalog.data.gov
Updated Jul 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Census Bureau (2023). 2013 American Community Survey - Table Packages: Detailed Language Spoken in the U.S. [Dataset]. https://catalog.data.gov/dataset/2013-american-community-survey-table-packages-detailed-language-spoken-in-the-u-s
Explore at:
Dataset updated
Jul 19, 2023
Dataset provided by
United States Census Bureauhttp://census.gov/
Area covered
United States
Description
This data set uses the 2009-2013 American Community Survey to tabulate the number of speakers of languages spoken at home and the number of speakers of each language who speak English less than very well. These tabulations are available for the following geographies: nation; each of the 50 states, plus Washington, D.C. and Puerto Rico; counties with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish; core-based statistical areas (metropolitan statistical areas and micropolitan statistical areas) with 100,000 or more total population and 25,000 or more speakers of languages other than English and Spanish.
Z
COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel...
data.niaid.nih.gov
zenodo.org
Updated Jan 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Habiba Drias (2021). COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel CoronaVirus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024176
Explore at:
Dataset updated
Jan 23, 2021
Dataset provided by
Habiba Drias
Yassine Drias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.

The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.

The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:

Author: the user who posted the tweet

Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field

Tweet: the full content of the tweet

Hashtags: the list of hashtags present in the tweet

Language: the language of the tweet

Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.

Location: the country of the author of the tweet, which is unfortunately not always available

Date: the publication date of the tweet

Source: the device or platform used to send the tweet

The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".
E
SALA II Spanish from Mexico database
catalogue.elra.info
live.european-language-grid.eu
Updated Aug 28, 2007
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). SALA II Spanish from Mexico database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0171/
Explore at:
Dataset updated
Aug 28, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
Mexico
Description
The SALA II Spanish from Mexico database collected in Mexico was recorded within the scope of the SALA II project.The SALA II Spanish from Mexico database contains the recordings of 1,075 Mexican speakers (539 males and 536 females) recorded over the Mexican mobile telephone network.The following acoustic conditions were selected as representative of a mobile user's environment: * Passenger in moving car, railway, bus, etc. (155 speakers) * Public place (279 speakers) * Stationary pedestrian by road side (223 speakers) * Home/office environment (364 speakers) * Passenger in moving car using a hands-free kit (54 speakers) This database is distributed as 1 DVD-ROM The speech files are stored as sequences of 8-bit, 8kHz a-law speech files and are not compressed, according to the specifications of SALA II. Each prompt utterance is stored within a separate file and has an accompanying ASCII SAM label file.This speech database was validated by SPEX (the Netherlands) to assess its compliance with the SALA II format and content specifications.Each speaker uttered the following items: * 6 application words * 1 sequence of 10 isolated digits * 4 connected digits (1 sheet number -6 digits, 1 telephone number -9/11 digits, 1 credit card number -14/16 digits, 1 PIN code -6 digits) * 3 dates (1 spontaneous date e.g. birthday, 1 word style prompted date, 1 relative and general date expression) * 2 spotting phrase using an embedded application word * 2 isolated digits * 3 spelled words (1surname, 1 directory assistance city name, 1 real/artificial name for coverage) * 1 currency money amount * 1 natural number * 5 directory assistance names (1 surname out of a set of 500, 1 city of birth/growing up, 1 most frequent city out of a set of 500, 1 most frequent company/agency out of a set of 500, 1 "forename surname" out of a set of 150 ) * 2 yes/no questions (1 predominantly "yes" question, 1 predominantly "no" question) * 9 phonetically rich sentences * 2 time phrases (1 spontaneous time of day, 1word style time phrase) * 4 phonetically rich words The following age distribution has been obtained: 7 speakers are under 16, 643 speakers are between 16 and 30, 248 speakers are between 31 and 45, 169 speakers are between 46 and 60, and 8 speakers are over 60.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
d
COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE
catalog.data.gov
data.ct.gov
Updated Aug 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.ct.gov (2023). COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE [Dataset]. https://catalog.data.gov/dataset/covid-19-cases-and-deaths-by-race-ethnicity
Explore at:
Dataset updated
Aug 12, 2023
Dataset provided by
data.ct.gov
Description
Note: DPH is updating and streamlining the COVID-19 cases, deaths, and testing data. As of 6/27/2022, the data will be published in four tables instead of twelve. The COVID-19 Cases, Deaths, and Tests by Day dataset contains cases and test data by date of sample submission. The death data are by date of death. This dataset is updated daily and contains information back to the beginning of the pandemic. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Cases-Deaths-and-Tests-by-Day/g9vi-2ahj. The COVID-19 State Metrics dataset contains over 93 columns of data. This dataset is updated daily and currently contains information starting June 21, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-State-Level-Data/qmgw-5kp6 . The COVID-19 County Metrics dataset contains 25 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-County-Level-Data/ujiq-dy22 . The COVID-19 Town Metrics dataset contains 16 columns of data. This dataset is updated daily and currently contains information starting June 16, 2022 to the present. The data can be found at https://data.ct.gov/Health-and-Human-Services/COVID-19-Town-Level-Data/icxw-cada . To protect confidentiality, if a town has fewer than 5 cases or positive NAAT tests over the past 7 days, those data will be suppressed. COVID-19 cases and associated deaths that have been reported among Connecticut residents, broken down by race and ethnicity. All data in this report are preliminary; data for previous dates will be updated as new reports are received and data errors are corrected. Deaths reported to the either the Office of the Chief Medical Examiner (OCME) or Department of Public Health (DPH) are included in the COVID-19 update. The following data show the number of COVID-19 cases and associated deaths per 100,000 population by race and ethnicity. Crude rates represent the total cases or deaths per 100,000 people. Age-adjusted rates consider the age of the person at diagnosis or death when estimating the rate and use a standardized population to provide a fair comparison between population groups with different age distributions. Age-adjustment is important in Connecticut as the median age of among the non-Hispanic white population is 47 years, whereas it is 34 years among non-Hispanic blacks, and 29 years among Hispanics. Because most non-Hispanic white residents who died were over 75 years of age, the age-adjusted rates are lower than the unadjusted rates. In contrast, Hispanic residents who died tend to be younger than 75 years of age which results in higher age-adjusted rates. The population data used to calculate rates is based on the CT DPH population statistics for 2019, which is available online here: https://portal.ct.gov/DPH/Health-Information-Systems--Reporting/Population/Population-Statistics. Prior to 5/10/2021, the population estimates from 2018 were used. Rates are standardized to the 2000 US Millions Standard population (data available here: https://seer.cancer.gov/stdpopulations/). Standardization was done using 19 age groups (0, 1-4, 5-9, 10-14, ..., 80-84, 85 years and older). More information about direct standardization for age adjustment is available here: https://www.cdc.gov/nchs/data/statnt/statnt06rv.pdf Categories are mutually exclusive. The category “multiracial” includes people who answered ‘yes’ to more than one race category. Counts may not add up to total case counts as data on race and ethnicity may be missing. Age adjusted rates calculated only for groups with more than 20 deaths. Abbreviation: NH=Non-Hispanic. Data on Connecticut deaths were obtained from the Connecticut Deaths Registry maintained by the DPH Office of Vital Records. Cause of death was determined by a death certifier (e.g., physician, APRN, medical
f
Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries:...
frontiersin.figshare.com
docx
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denisse Manrique-Millones; Georgy M. Vasin; Sergio Dominguez-Lara; Rosa Millones-Rivalles; Ricardo T. Ricci; Milagros Abregu Rey; María Josefina Escobar; Daniela Oyarce; Pablo Pérez-Díaz; María Pía Santelices; Claudia Pineda-Marín; Javier Tapia; Mariana Artavia; Maday Valdés Pacheco; María Isabel Miranda; Raquel Sánchez Rodríguez; Clara Isabel Morgades-Bamba; Ainize Peña-Sarrionandia; Fernando Salinas-Quiroz; Paola Silva Cabrera; Moïra Mikolajczak; Isabelle Roskam (2023). Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries: An Exploratory Structural Equation Modeling Approach.DOCX [Dataset]. http://doi.org/10.3389/fpsyg.2022.827014.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.827014.s001
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Denisse Manrique-Millones; Georgy M. Vasin; Sergio Dominguez-Lara; Rosa Millones-Rivalles; Ricardo T. Ricci; Milagros Abregu Rey; María Josefina Escobar; Daniela Oyarce; Pablo Pérez-Díaz; María Pía Santelices; Claudia Pineda-Marín; Javier Tapia; Mariana Artavia; Maday Valdés Pacheco; María Isabel Miranda; Raquel Sánchez Rodríguez; Clara Isabel Morgades-Bamba; Ainize Peña-Sarrionandia; Fernando Salinas-Quiroz; Paola Silva Cabrera; Moïra Mikolajczak; Isabelle Roskam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parental burnout is a unique and context-specific syndrome resulting from a chronic imbalance of risks over resources in the parenting domain. The current research aims to evaluate the psychometric properties of the Spanish version of the Parental Burnout Assessment (PBA) across Spanish-speaking countries with two consecutive studies. In Study 1, we analyzed the data through a bifactor model within an Exploratory Structural Equation Modeling (ESEM) on the pooled sample of participants (N = 1,979) obtaining good fit indices. We then attained measurement invariance across both gender and countries in a set of nested models with gradually increasing parameter constraints. Latent means comparisons across countries showed that among the participants’ countries, Chile had the highest parental burnout score, likewise, comparisons across gender evidenced that mothers displayed higher scores than fathers, as shown in previous studies. Reliability coefficients were high. In Study 2 (N = 1,171), we tested the relations between parental burnout and three specific consequences, i.e., escape and suicidal ideations, parental neglect, and parental violence toward one’s children. The medium to large associations found provided support for the PBA’s predictive validity. Overall, we concluded that the Spanish version of the PBA has good psychometric properties. The results support its relevance for the assessment of parental burnout among Spanish-speaking parents, offering new opportunities for cross-cultural research in the parenting domain.
a
Percentage of Hispanic
egis-lacounty.hub.arcgis.com
geohub.lacity.org
+2more
Updated Dec 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
County of Los Angeles (2023). Percentage of Hispanic [Dataset]. https://egis-lacounty.hub.arcgis.com/datasets/percentage-of-hispanic
Explore at:
Dataset updated
Dec 22, 2023
Dataset authored and provided by
County of Los Angeles
Area covered

Description
For the past several censuses, the Census Bureau has invited people to self-respond before following up in-person using census takers. The 2010 Census invited people to self-respond predominately by returning paper questionnaires in the mail. The 2020 Census allows people to self-respond in three ways: online, by phone, or by mail. The 2020 Census self-response rates are self-response rates for current census geographies. These rates are the daily and cumulative self-response rates for all housing units that received invitations to self-respond to the 2020 Census. The 2020 Census self-response rates are available for states, counties, census tracts, congressional districts, towns and townships, consolidated cities, incorporated places, tribal areas, and tribal census tracts. The Self-Response Rate of Los Angeles County is 65.1% for 2020 Census, which is slightly lower than 69.6% of California State rate. More information about these data are available in the Self-Response Rates Map Data and Technical Documentation document associated with the 2020 Self-Response Rates Map or review our FAQs. Animated Self-Response Rate 2010 vs 2020 is available at ESRI site SRR Animated Maps and can explore Census 2020 SRR data at ESRI Demographic site Census 2020 SSR Data. Following Demographic Characteristics are included in this data and web maps to visualize their relationships with Census Self-Response Rate (SRR)..1. Population Density2. Poverty Rate3. Median Household income4. Education Attainment5. English Speaking Ability6. Household without Internet Access7. Non-Hispanic White Population8. Non-Hispanic African-American Population9. Non-Hispanic Asian Population10. Hispanic Population
p
International Spanish Language Academy
publicschoolreview.com
json, xml
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Public School Review, International Spanish Language Academy [Dataset]. https://www.publicschoolreview.com/international-spanish-language-academy-profile
Explore at:
xml, jsonAvailable download formats
Dataset authored and provided by
Public School Review
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 2009 - Dec 31, 2025
Description
Historical Dataset of International Spanish Language Academy is provided by PublicSchoolReview and contain statistics on metrics:Total Students Trends Over Years (2009-2023),Total Classroom Teachers Trends Over Years (2009-2023),Distribution of Students By Grade Trends,Student-Teacher Ratio Comparison Over Years (2009-2023),Asian Student Percentage Comparison Over Years (2009-2023),Hispanic Student Percentage Comparison Over Years (2009-2023),Black Student Percentage Comparison Over Years (2009-2023),White Student Percentage Comparison Over Years (2009-2023),Two or More Races Student Percentage Comparison Over Years (2013-2023),Diversity Score Comparison Over Years (2009-2023),Free Lunch Eligibility Comparison Over Years (2009-2023),Reduced-Price Lunch Eligibility Comparison Over Years (2009-2023),Reading and Language Arts Proficiency Comparison Over Years (2011-2022),Math Proficiency Comparison Over Years (2012-2023),Science Proficiency Comparison Over Years (2021-2022),Overall School Rank Trends Over Years (2012-2023)

Facebook

Twitter

Click to copy link

Link copied

Cite

FutureBee AI (2022). Mexican Spanish General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-spanish-mexico

Mexican Spanish General Conversation Speech Dataset for ASR

Mexican Spanish General Conversation Speech Corpus

Explore at:

wavAvailable download formats

Dataset updated

Aug 1, 2022

Dataset provided by

FutureBeeAI

Authors

FutureBee AI

License

https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

Area covered

Mexico

Dataset funded by

FutureBeeAI

Description

Introduction

Welcome to the Mexican Spanish General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of Spanish speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world Mexican Spanish communication.

Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade Spanish speech models that understand and respond to authentic Mexican accents and dialects.

Speech Data

The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of Mexican Spanish. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

•Participant Diversity:

•

Speakers: 60 verified native Mexican Spanish speakers from FutureBeeAI’s contributor community.

•

Regions: Representing various provinces of Mexico to ensure dialectal diversity and demographic balance.

•

Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:

•

Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•

Duration: Each conversation ranges from 15 to 60 minutes.

•

Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•

Environment: Quiet, echo-free settings with no background noise.

Topic Diversity

The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

•Sample Topics Include:

•Family & Relationships

•Food & Recipes

•Education & Career

•Healthcare Discussions

•Social Issues

•Technology & Gadgets

•Travel & Local Culture

•Shopping & Marketplace Experiences, and many more.

Transcription

Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

•Transcription Highlights:

•Speaker-segmented dialogues

•Time-coded utterances

•Non-speech elements (pauses, laughter, etc.)

•High transcription accuracy, achieved through double QA pass, average WER < 5%

These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

Metadata

The dataset comes with granular metadata for both speakers and recordings:

•

Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•

Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

Usage and Applications

This dataset is a versatile resource for multiple Spanish speech and language AI applications:

•

ASR Development: Train accurate speech-to-text systems for Mexican Spanish.

•

Voice Assistants: Build smart assistants capable of understanding natural Mexican conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px;

Clear search

Close search

Google apps

Main menu

Mexican Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

US Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Data from: Spanish (Mexico) Dataset

Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

Data from: IA Tweets Analysis Dataset (Spanish)

EpaDB

Colombian Spanish General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Dictionary of Spanish Language 22 ed. (2001) - DLE22 (ELEXIS) - Dataset -...

120 Million Word Spanish Corpus

SALA Spanish Venezuelan Database

Census Data - Languages spoken in Chicago, 2008 – 2012

Speaker Township, Michigan Non-Hispanic Population Breakdown By Race...

About this dataset

Content

Inspiration

Recommended for further research

SALA Spanish Mexican Database

2013 American Community Survey - Table Packages: Detailed Language Spoken in...

COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel...

SALA II Spanish from Mexico database

COVID-19 Cases and Deaths by Race/Ethnicity - ARCHIVE

Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries:...

Percentage of Hispanic

International Spanish Language Academy

Mexican Spanish General Conversation Speech Dataset for ASR

Mexican Spanish General Conversation Speech Corpus

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications