25 datasets found

Spanish Spontaneous Dialogue speech dataset
kaggle.com
zip
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). Spanish Spontaneous Dialogue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/spanish-spontaneous-dialogue-speech-dataset
Explore at:
zip(93236 bytes)Available download formats
Dataset updated
Jun 7, 2024
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Spanish(Spain) Spontaneous Dialogue Telephony speech dataset

Description

Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1234?source=Kaggle

Format

8kHz 8bit, a-law/u-law pcm, mono channel

Content category

Dialogue based on given topics

Recording condition

Low background noise (indoor)

Recording device

Telephony

Country

Spain(ESP)

Language(Region) Code

es-ES

Language

Spanish

Speaker

600 people in total, 49% male and 51% female

Features of annotation

Transcription text, timestamp, speaker ID, gender

Accuracy rate

Word accuracy rate(WAR) 98%

Licensing Information

Commercial License
F
In-Car Speech Dataset: Spanish (Mexico)
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). In-Car Speech Dataset: Spanish (Mexico) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/in-car-speech-dataset-spanish-mexico
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Mexico
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Mexican Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
Speech Data
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of Mexico1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Dataset Diversity
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
Metadata
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
License
This Mexican Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...
datarade.ai
Updated Jul 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS | Dictionary Display | Translations | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/spanish-language-datasets-1-8m-sentences-nlp-tts-dic-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
Dataset updated
Jul 11, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Bolivia (Plurinational State of), Chile, Colombia, Nicaragua, Ecuador, Paraguay, Panama, Costa Rica, Cuba, Honduras
Description
Linguistically annotated Spanish language datasets with headwords, definitions, senses, examples, POS tags, semantic metadata, and usage info. Ideal for dictionary tools, NLP, and TTS model training or fine-tuning.

Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Synonyms and Antonyms Data

Audio Data

Spanish Word List Data

Key Features (approximate numbers):

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Words: 73,000

Senses: 123,000

Example sentences: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.

About the sample:

The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information
Spanish Spontaneous Dialogue Telephony speech
kaggle.com
zip
Updated Jun 11, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). Spanish Spontaneous Dialogue Telephony speech [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/spanish-spontaneous-dialogue-telephony-speech/code
Explore at:
zip(215338 bytes)Available download formats
Dataset updated
Jun 11, 2024
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
88-Hours-Mexican-Spanish-Conversational-Speech-Data-by-Telephone

Description

Spanish(Mexico) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(122 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link:https://www.nexdata.ai/datasets/speechrecog/1352?source=Kaggle

Format

8kHz 8bit, a-law/u-law pcm, mono channel

Content category

Dialogue based on given topics

Recording condition

Low background noise (indoor)

Recording device

Telephony

Country

Mexico(MEX)

Language(Region) Code

es-MX

Language

Spanish

Speaker

122 people in total, 53% male and 47% female

Features of annotation

Transcription text, timestamp, speaker ID, gender, noise

Accuracy rate

Word accuracy rate(WAR) 98%

Licensing Information

Commercial License
F
In-Car Speech Dataset: Bulgarian (Bulgaria)
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). In-Car Speech Dataset: Bulgarian (Bulgaria) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/in-car-speech-dataset-bulgarian-bulgaria
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Bulgaria
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the US Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
Speech Data
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of USA1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Dataset Diversity
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
Metadata
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
License
This US Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
F
In-Car Speech Dataset: Spanish (Argentina)
futurebeeai.com
wav
Updated Aug 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). In-Car Speech Dataset: Spanish (Argentina) [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/in-car-speech-dataset-spanish-argentina
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
Argentina
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the Argentinians Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
Speech Data
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of Argentina1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Dataset Diversity
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
Metadata
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
License
This Argentinians Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
Spanish Housing Dataset: Location, Size, Price,
kaggle.com
zip
Updated Nov 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2022). Spanish Housing Dataset: Location, Size, Price, [Dataset]. https://www.kaggle.com/datasets/thedevastator/spanish-housing-dataset-location-size-price-and/code
Explore at:
zip(45386344 bytes)Available download formats
Dataset updated
Nov 26, 2022
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Spanish Housing Dataset: Location, Size, Price, and More!

Now with 100% More Fun!

By [source]

About this dataset

Looking for a place to live in Spain? This dataset contains information about houses in various Spanish provinces that will help you with your search! The data includes information about the houses such as location, size, price, amenities, and more. With this dataset, you can study the housing market in Spain, compare prices and styles of houses across different provinces, or learn more about the features of houses in different parts of the country. So whether you're looking for your dream home or just curious about Spanish real estate, this dataset is a great place to start!

More Datasets

For more datasets, click here.

Featured Notebooks

🚨 Your notebook can be here! 🚨!

How to use the dataset

The Spanish Housing Dataset contains information about houses in various Spanish provinces. The data includes information about the houses such as location, size, price, amenities, and so on. This dataset can be used to study the housing market in Spain, to compare prices and styles of houses in different provinces, or to find out more about the features of houses in different parts of

Research Ideas

To study the housing market in Spain and compare prices and styles of houses in different provinces

To find out more about the features of houses in different parts of the country

To compare prices and styles of houses in different parts of the province

Acknowledgements

If you use this dataset in your research, please credit the original authors.

Data Source

License

License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

Columns

File: addinfo.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------| | poblacion | The population of the city where the house is located. (Numeric) | | source | The source of the data. (Categorical) |

File: links.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | link | The URL of the listing. (String) | | num_link | The listing's unique identifier. (String) | | obtention_date | The date on which the listing was collected. (String) |

File: rentas_PV.csv

File: rentas_espanya.csv | Column name | Description | |:----------------------------|:------------------------------------------| | Número de declaraciones | The number of tax declarations. (Numeric) |

File: zones.csv | Column name | Description | |:--------------|:-------------------------------------| | type | The type of the house. (Categorical) |

File: houses_alava.csv | Column name | Description | |:----------------------|:------------------------------------------------------------------------------------| | obtention_date | The date on which the listing was collected. (String) | | ad_description | A description of the house. (String) | | ad_last_update | The date of the last update to the listing. (String) | | air_conditioner | A indicator of whether or not the house has air conditioning. (Boolean) | | balcony | A indicator of whether or not the house has a balcony. (Boolean) | | bath_num | The number of bathrooms in the house. (Integer) | | built_in_wardrobe | A indicator of whether or not the house has a built in wardrobe. (Boolean) | | chimney | A indicator of whether or not the house has a chimney. (Boolean) | | construct_date | The date the house was constructed. (String) | | energetic_certif | The energetic certification of the house. (String) | | **fl...
f
Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries:...
figshare.com
frontiersin.figshare.com
docx
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Denisse Manrique-Millones; Georgy M. Vasin; Sergio Dominguez-Lara; Rosa Millones-Rivalles; Ricardo T. Ricci; Milagros Abregu Rey; María Josefina Escobar; Daniela Oyarce; Pablo Pérez-Díaz; María Pía Santelices; Claudia Pineda-Marín; Javier Tapia; Mariana Artavia; Maday Valdés Pacheco; María Isabel Miranda; Raquel Sánchez Rodríguez; Clara Isabel Morgades-Bamba; Ainize Peña-Sarrionandia; Fernando Salinas-Quiroz; Paola Silva Cabrera; Moïra Mikolajczak; Isabelle Roskam (2023). Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries: An Exploratory Structural Equation Modeling Approach.DOCX [Dataset]. http://doi.org/10.3389/fpsyg.2022.827014.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2022.827014.s001
Dataset updated
Jun 14, 2023
Dataset provided by
Frontiers
Authors
Denisse Manrique-Millones; Georgy M. Vasin; Sergio Dominguez-Lara; Rosa Millones-Rivalles; Ricardo T. Ricci; Milagros Abregu Rey; María Josefina Escobar; Daniela Oyarce; Pablo Pérez-Díaz; María Pía Santelices; Claudia Pineda-Marín; Javier Tapia; Mariana Artavia; Maday Valdés Pacheco; María Isabel Miranda; Raquel Sánchez Rodríguez; Clara Isabel Morgades-Bamba; Ainize Peña-Sarrionandia; Fernando Salinas-Quiroz; Paola Silva Cabrera; Moïra Mikolajczak; Isabelle Roskam
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Parental burnout is a unique and context-specific syndrome resulting from a chronic imbalance of risks over resources in the parenting domain. The current research aims to evaluate the psychometric properties of the Spanish version of the Parental Burnout Assessment (PBA) across Spanish-speaking countries with two consecutive studies. In Study 1, we analyzed the data through a bifactor model within an Exploratory Structural Equation Modeling (ESEM) on the pooled sample of participants (N = 1,979) obtaining good fit indices. We then attained measurement invariance across both gender and countries in a set of nested models with gradually increasing parameter constraints. Latent means comparisons across countries showed that among the participants’ countries, Chile had the highest parental burnout score, likewise, comparisons across gender evidenced that mothers displayed higher scores than fathers, as shown in previous studies. Reliability coefficients were high. In Study 2 (N = 1,171), we tested the relations between parental burnout and three specific consequences, i.e., escape and suicidal ideations, parental neglect, and parental violence toward one’s children. The medium to large associations found provided support for the PBA’s predictive validity. Overall, we concluded that the Spanish version of the PBA has good psychometric properties. The results support its relevance for the assessment of parental burnout among Spanish-speaking parents, offering new opportunities for cross-cultural research in the parenting domain.
16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...
data.nexdata.ai
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2024). 16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM) Data | Speech AI Datasets |Multilingual Language Data [Dataset]. https://data.nexdata.ai/products/nexdata-multilingual-conversational-speech-data-16khz-mob-nexdata
Explore at:
Dataset updated
Aug 3, 2024
Dataset authored and provided by
Nexdata
Area covered
Brazil, Syrian Arab Republic, Hong Kong, Ukraine, Pakistan, Egypt, Malaysia, Italy, Switzerland, Bulgaria
Description
Nexdata has off-the-shelf 35,000 hours Multilingual Language Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.
Z
COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel...
data.niaid.nih.gov
Updated Jan 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yassine Drias; Habiba Drias (2021). COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel CoronaVirus [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4024176
Explore at:
Dataset updated
Jan 23, 2021
Dataset provided by
LRIA - USTHB
LRIA - University of Algiers
Authors
Yassine Drias; Habiba Drias
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.

The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.

The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:

Author: the user who posted the tweet

Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field

Tweet: the full content of the tweet

Hashtags: the list of hashtags present in the tweet

Language: the language of the tweet

Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.

Location: the country of the author of the tweet, which is unfortunately not always available

Date: the publication date of the tweet

Source: the device or platform used to send the tweet

The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".
d
Global English Speech with Accent Conversational Dataset — Multi-Region...
datarade.ai
.wav
Updated Jul 21, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FileMarket (2025). Global English Speech with Accent Conversational Dataset — Multi-Region Validated Speech with Gender, Age & Metadata for AI & NLP Training [Dataset]. https://datarade.ai/data-products/global-english-speech-with-accent-conversational-dataset-mu-filemarket
Explore at:
.wavAvailable download formats
Dataset updated
Jul 21, 2025
Dataset authored and provided by
FileMarket
Area covered
Tonga, Montenegro, United States Minor Outlying Islands, Nicaragua, Haiti, Iceland, Cook Islands, Comoros, Bangladesh, Yemen
Description
The Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.

Regions and Covered Countries with Primary Spoken Languages:

Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)

Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)

Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)

Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)

Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)

Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications

Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling

This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.

👨‍👩‍👧 US Country Demographics

kaggle.com

zip

Updated Aug 14, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

mexwell (2023). 👨‍👩‍👧 US Country Demographics [Dataset]. https://www.kaggle.com/datasets/mexwell/us-country-demographics

Explore at:

zip(343499 bytes)Available download formats

Dataset updated

Aug 14, 2023

Authors

mexwell

License

http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

Area covered

United States

Description

The following data set is information obtained about counties in the United States from 2010 through 2019 through the United States Census Bureau. Information described in the data includes the age distributions, the education levels, employment statistics, ethnicity percents, houseold information, income, and other miscellneous statistics. (Values are denoted as -1, if the data is not available)

Data Dictionary

<...

Key	List of...	Comment	Example Value
County	String	County name	`"Abbeville County"`
State	String	State name	`"SC"`
Age.Percent 65 and Older	Float	Estimated percentage of population whose ages are equal or greater than 65 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico).	`22.4`
Age.Percent Under 18 Years	Float	Estimated percentage of population whose ages are under 18 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico).	`19.8`
Age.Percent Under 5 Years	Float	Estimated percentage of population whose ages are under 5 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico).	`4.7`
Education.Bachelor's Degree or Higher	Float	Percentage for the people who attended college but did not receive a degree and people who received an associate's bachelor's master's or professional or doctorate degree. These data include only persons 25 years old and over. The percentages are obtained by dividing the counts of graduates by the total number of persons 25 years old and over. Tha data is collected from 2015 to 2019.	`15.6`
Education.High School or Higher	Float	Percentage of people whose highest degree was a high school diploma or its equivalent people who attended college but did not receive a degree and people who received an associate's bachelor's master's or professional or doctorate degree. These data include only persons 25 years old and over. The percentages are obtained by dividing the counts of graduates by the total number of persons 25 years old and over. Tha data is collected from 2015 to 2019	`81.7`
Employment.Nonemployer Establishments	Integer	An establishment is a single physical location at which business is conducted or where services or industrial operations are performed. It is not necessarily identical with a company or enterprise which may consist of one establishment or more. The data was collected from 2018.	`1416`
Ethnicities.American Indian and Alaska Native Alone	Float	Estimated percentage of population having origins in any of the original peoples of North and South America (including Central America) and who maintains tribal affiliation or community attachment. This category includes people who indicate their race as "American Indian or Alaska Native" or report entries such as Navajo Blackfeet Inupiat Yup'ik or Central American Indian groups or South American Indian groups.	`0.3`
Ethnicities.Asian Alone	Float	Estimated percentage of population having origins in any of the original peoples of the Far East Southeast Asia or the Indian subcontinent including for example Cambodia China India Japan Korea Malaysia Pakistan the Philippine Islands Thailand and Vietnam. This includes people who reported detailed Asian responses such as: "Asian Indian " "Chinese " "Filipino " "Korean " "Japanese " "Vietnamese " and "Other Asian" or provide other detailed Asian responses.	`0.4`
Ethnicities.Black Alone	Float	Estimated percentage of population having origins in any of the Black racial groups of Africa. It includes people who indicate their race as "Black or African American " or report entries such as African American Kenyan Nigerian or Haitian.	`27.6`
Ethnicities.Hispanic or Latino	Float

LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...
datarade.ai
Updated Jul 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | Dictionary Display | Translation Data | LATAM Coverage [Dataset]. https://datarade.ai/data-products/latam-data-suite-1-8m-sentences-nlp-tts-dictionary-d-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
Dataset updated
Jul 22, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Panama, Dominican Republic, Bolivia (Plurinational State of), Colombia, Puerto Rico, Mexico, Ecuador, Uruguay, Spain, Peru
Description
LATAM Data Suite provides high-quality datasets in Spanish, Portuguese, and American English. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

Sentences Curated examples of real-world usage with contextual annotations.

Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

Audio Data Native speaker recordings for TTS and pronunciation modeling.

Word Lists Frequency-ranked and thematically grouped lists.

Learn more about the datasets included in the data suite:

Portuguese Monolingual Dictionary Data

Portuguese Bilingual Dictionary Data

Spanish Monolingual Dictionary Data

Spanish Bilingual Dictionary Data

Spanish Sentences Data

Spanish Synonyms and Antonyms Data

Spanish Audio Data

Spanish Word List Data

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

Portuguese Monolingual Dictionary Data

Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

Words: 143,600

Senses: 285,500

Example sentences: 69,300

Format: XML format

Delivery: Email (link-based file sharing)

Portuguese Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.

Translations: 300,000

Senses: 158,000

Example translations: 117,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Monolingual Dictionary Data

Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

Words: 73,000

Senses: 123,000

Example sentences: 104,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Bilingual Dictionary Data

The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

Translations: 221,300

Senses: 103,500

Example sentences: 74,500

Example translations: 83,800

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

Spanish Sentences Data

Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

Sentences volume: 1,840,000

Format: XML and JSON formats

Delivery: Email (link-based file sharing) and REST API

Spanish Synonyms and Antonyms Data

This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

Synonyms: 127,700

Antonyms: 9,500

Format: XML format

Delivery: Email (link-based file sharing)

Updated frequency: annually

Spanish Audio Data (word-level)

Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

Audio files: 20,900

Format: XLSX (for index), MP3 and WAV (audio files)

Spanish Word List Data

This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

Wordforms: 450,000

Format: CSV and TXT formats

Delivery: Email (link-based file sharing)

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost au...
f
Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An...
frontiersin.figshare.com
figshare.com
docx
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez (2023). Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An Adaptation and Validation Study.docx [Dataset]. http://doi.org/10.3389/fpsyg.2021.714145.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2021.714145.s001
Dataset updated
Jun 6, 2023
Dataset provided by
Frontiers
Authors
Fátima Salas-Rodríguez; Sonia Lara; Martín Martínez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Teachers’ Sense of Efficacy Scale (TSES) has been the most widely used instrument to assess teacher efficacy beliefs. However, no study has been carried out concerning the TSES psychometric properties with teachers in Mexico, the country with the highest number of Spanish-speakers worldwide. The purpose of the present study is to examine the reliability, internal and external validity evidence of the TSES (short form) adapted into Spanish with a sample of 190 primary and secondary Mexican teachers from 25 private schools. Results of construct analysis confirm the three-factor-correlated structure of the original scale. Criterion validity evidence was established between self-efficacy and job satisfaction. Differences in self-efficacy were related to teachers’ gender, years of experience and grade level taught. Some limitations are discussed, and future research directions are recommended.
Spanish conversation smart phone
kaggle.com
zip
Updated Jun 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Appen Limited (2025). Spanish conversation smart phone [Dataset]. https://www.kaggle.com/datasets/appenlimited/spanish-conversation-smart-phone/code
Explore at:
zip(285111724 bytes)Available download formats
Dataset updated
Jun 13, 2025
Authors
Appen Limited
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
如需完整数据集或了解更多，请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com

*** THIS IS A SAMPLE DATABASE ONLY. THE INFORMATION CONTAINED IN *** THE REST OF THIS README APPLIES TO THE FULL DATABASE

This is a Spanish (EU) conversational database, produced by Appen Butler Hill Pty. Ltd. in 2021.

Appen Butler Hill Pty. Ltd. owns copyright of the database.

The database contains transcription and speech data recorded during 207 sessions. Each of the 207 unique speaker-pairs was recorded making conversations of approximately an average of 60 minutes.

Each pair of speakers recorded 4 to 12 conversations of approximately an average of 5-15 min on different topics. Speakers were provided with a topic for each conversation.

The recordings have been made using the Appen Mobile smartphone app with the phone positioned between two speakers who are in the same room.

The database contains approximately 223.48 hours of audio data in total.

The directory structure is designed to group data from each pair of speakers in a single folder. Each pair of speakers has been identified with a unique ID (Users_ID). This ID is also the name of each session folder. Within each session folder are the conversations made by the pair of speakers. The file naming structure indicates the language and country, date of recording, session ID (which is different from the speaker pair ID), and the conversational topic.

Directory Structure:

/+-COPYRIGHT.TXT +-README.TXT +-DEMOGRAPHICS.CSV | | +-AUDIO----------+-1027536----------------+-SPAESP_20210123-224609-0006_Topic_Clothing.WAV | | +-SPAESP_20210123-224609-0007_Topic_Insurance.WAV | | . | | . | | +-SPAESP_20210123-224609-0017_Topic_Social.WAV | | | | | +-1089511----------------+-SPAESP_20210220-140351-0001_Topic_Information.WAV | | +-SPAESP_20210220-140351-0002_Topic_Insurance.WAV | | . | | . | | +-SPAESP_20210220-140351-0013_Topic_Media.WAV | . | . | . | +-980733-----------------+-SPAESP_20210313-013101-0002_Topic_Travel.WAV | +-SPAESP_20210313-013101-0003_Topic_Insurance.WAV | . | . | +-SPAESP_20210313-013101-0011_Topic_Health.WAV | | |
+-TRANSCRIPTION--+-1027536----------------+-SPAESP_20210123-224609-0006_Topic_Clothing.TXT | | +-SPAESP_20210123-224609-0007_Topic_Insurance.TXT | | . | | . | | +-SPAESP_20210123-224609-0017_Topic_Social.TXT | | | | | +-1089511----------------+-SPAESP_20210220-140351-0001_Topic_Information.TXT | | +-SPAESP_20210220-140351-0002_Topic_Insurance.TXT | | . | | . | | +-SPAESP_20210220-140351-0013_Topic_Media.TXT | . | . | . | +-980733-----------------+-SPAESP_20210313-013101-0002_Topic_Travel.TXT | +-SPAESP_20210313-013101-0003_Topic_Insurance.TXT | . | . | +-SPAESP_20210313-013101-0011_Topic_Health.TXT |

COPYRIGHT.TXT is a copyright document in ASCII format.

README.TXT is this file. It is an ASCII text file that describes the database.

DEMOGRAPHICS.CSV is an Excel file that contains the following fields: - Users_ID - Device_Model - Device_OS - Participant_1_Gender - Participant_1_Age - Participant_1_Dialect - Participant_2_Gender - Participant_2_Age - Participant_2_Dialect - Topics - Environment

CONVERSATION TOPICS: The conversations were spontaneous and were on a variety of generic topics (e.g. news, travel, study etc.). The topics can be found in the Demographics file.

Participants were provided with 12 topics to choose from. They needed to pick at least 4 topics and may skip up to 8 topics.

/AUDIO contains all audio data from the 207 sessions. Audio format is WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz

/TRANSCRIPTION contains all transcription data for the 207 sessions.

The Audio and Transcription filenames use the following template:
f
Data from: Does educational level predict hearing aid self-efficacy in...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Dec 19, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fuente, Adrian; Fuentes-López, Eduardo; Luna-Monsalve, Manuel; Valdivia, Gonzalo (2019). Does educational level predict hearing aid self-efficacy in experienced older adult hearing aid users from Latin America? Validation process of the Spanish version of the MARS-HA questionnaire [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000131807
Explore at:
Dataset updated
Dec 19, 2019
Authors
Fuente, Adrian; Fuentes-López, Eduardo; Luna-Monsalve, Manuel; Valdivia, Gonzalo
Area covered
Latin America
Description
Hearing aids are the most common rehabilitation strategy for age-related hearing loss. However, 25% to 50% of older adults fitted with hearing aids do not wear them post-fitting. Hearing aid self-efficacy has been suggested as one of the key factors that may explain adherence to hearing aids in older adults. The primary aim of this study was to determine a possible association between educational level and hearing aid self-efficacy in older adult hearing aid users from a Latin American country (i.e., Chile). The secondary aim was to determine if in this sample of older adults, hearing aid self-efficacy predicted hearing aid adherence as previously suggested by other studies. The MARS-HA (Measure of Audiologic Rehabilitation Self-Efficacy for Hearing Aids) questionnaire was used to measure hearing aid self-efficacy. This questionnaire was initially adapted into Spanish (S-MARS-HA) using forward and backward translations by bilingual English-Spanish speakers. A sample of 252 older adults fitted with hearing aids at a public hospital in Santiago, Chile, was investigated. Educational level was measured as the number of years of formal education. Participants responded to the S-MARS-HA along with questions exploring social support, attitudes in using hearing aids, participation in social events, and vision and joint problems. Hearing aid adherence was investigated with the use of a question from the International Outcome Inventory for Hearing Aids. All these procedures were conducted at the participants’ homes. Pure-tone average (PTA; 500–4000 Hz) in the fitted ear was obtained from the participants’ medical records. Univariate and multivariate regression models were constructed to investigate the association between educational level and hearing aid self-efficacy controlling for the covariates of interest (e.g., social support, attitudes in using hearing aids, PTA). The S-MARS-HA showed an adequate construct validity along with a good reliability. Results of the multivariate regression analyses showed that educational level significantly predicted hearing aid self-efficacy. Covariates significantly associated with this outcome included attitudes in using hearing aids and PTA in the fitted ear. Finally, a significant association between hearing aid self-efficacy and adherence to hearing aid use was observed. In conclusion, this study showed a significant association between educational level and hearing aid self-efficacy in older adults from a developing Latin American country. Thus, this variable should be considered when designing and delivering aural rehabilitation programs such as hearing aids to older adults, especially those from developing countries.
e
Data from: DIHANA corpus
ekoizpen-zientifikoa.ehu.eus
Updated 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benedí, José Miguel; Lleida, Eduardo; Varona, Amparo; Benedí, José Miguel; Lleida, Eduardo; Varona, Amparo (2021). DIHANA corpus [Dataset]. https://ekoizpen-zientifikoa.ehu.eus/documentos/668fc45bb9e7c03b01bdaf01
Explore at:
Dataset updated
2021
Authors
Benedí, José Miguel; Lleida, Eduardo; Varona, Amparo; Benedí, José Miguel; Lleida, Eduardo; Varona, Amparo
Description
DIHANA is composed of 900 human-computer dialogues in Spanish. The acquisition of the DIHANA corpus was carried out by means of an initial prototype using the Wizard of Oz technique. The operator simulates speech recognition and understanding errors and the answers being synthesized according to a predefined set of templates. This acquisition was only restricted at the semantical level (i.e., the acquired dialogues are related to a specific task domain) and was not restricted at the lexical and syntactical level (spontaneous speech). In the acquisition process, this semantic control was provided by the definition of scenarios that the user must accomplish and by the wizard strategy, which defines the behavior of the acquisition system. The DIHANA task consists of the retrieval of information about Spanish nationwide trains by telephone. Several types of scenarios were defined in order to control the interaction of the user with the system. A scenario is defined by: an objective, the information needed by the user; a situation, the specific circumstances related to the trip request; and the specific requirements of the trip, type of trip, departure city, destination city, and one or more restrictions. The DIHANA corpus contains 5.5 hours of spontaneous speech corresponding to 6278 sentences. In total 225 speakers (153 males and 72 females) recorded 900 dialogues, resulting in 6,278 user turns. Along with the dialogues (speech signals), their full transcript is also provided and a lexicon phonetically containing all the words pronounced in the database. In addition a semantic tagging of the corpus and a labeling of the same corpus in terms of dialog acts is also provided. A more detailed description of DIHANA can be found in the "doc" subfolder and in the following papers: - N. Alcacer, J.M. Benedí, F. Blat, R. Granell, D. Martínez-Hinarejos and F. Torres: "Acquisition and Labelling of a Spontaneous Speech Dialogue Corpus". In proceedings of SPECOM, pages 583-586. Patras (Greece), October 2005. - J.M. Benedí E.Lleida, A. Varona, M.J.Castro, I.Galiano, R.Justo, I. López, and A. Miguel: "Design and acquisition of a telephone spontaneous speech dialogue corpus in spanish: DIHANA". In proceedings of LREC, pages 1636-1639, Genova, Italy, May 2006. Next we describe the contents of each subfolder: README: This file. data: The database: 75 speakers of 3 sites (Basque Country,
Aragon and Valencian Country) for 4 scenarios making a
total of 225 speakers (153 males and 72 females) with
900 dialogues. For each dialog is provided: - the speech signal of each user turn (.ul)
- the intermediate (.dis) and final (.xml) transcriptions
- the dialogue Act annotation + Dialogue Act annotation on transcription (.dia)
+ Dialogue Act annotation on categorised transcription,
without words for each category (.cad)
+ Dialogue Act annotation on categorised transcription,
with words for each category (.cwd) semdata: Semantic tagging of the full corpora, in the "data" subfolder,
and documentation describing the process of semátinco labeling,
in the "doc" folder. doc: Various documents (PDF) related to the design and acquisition
processes, the annotation format and event statistics. guides: 5 lists of 45 speakers, which account for 5 leaving-one-out
partitions. One of the lists can be alternatively chosen as
the test set, the other four joined to form the training set.
Under the folder corresponding to each speaker, the speech
signals and transcriptions corresponding to four dialogues
can be found, so each partition consists of 720 training
dialogues an 180 test dialogues. software: Various self-commented programs and utilities.
Hispanic People - Liveness Detection Video Dataset
kaggle.com
zip
Updated Apr 19, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unique Data (2024). Hispanic People - Liveness Detection Video Dataset [Dataset]. https://www.kaggle.com/datasets/trainingdatapro/hispanic-people-liveness-detection-video-dataset/code
Explore at:
zip(216247226 bytes)Available download formats
Dataset updated
Apr 19, 2024
Authors
Unique Data
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Biometric Attack Dataset, Hispanic People

The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset

The dataset for face anti spoofing and face recognition includes images and videos of hispanic people. 32,600+ photos & video of 16,300 people from 20 countries. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group.

The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.

The dataset contains images and videos of real humans with various resolutions, views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.

People in the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F0af40cfdcb1e53ab635c56d179135f58%2FFrame%20107.png?generation=1713530575878904&alt=media" alt="">

Types of files in the dataset:

photo - selfie of the person

video - real video of the person

Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

assignment_id - unique identifier of the media file

worker_id - unique identifier of the person

age - age of the person

true_gender - gender of the person

country - country of the person

video_extension - video extensions in the dataset

video_resolution - video resolution in the dataset

video_duration - video duration in the dataset

video_fps - frames per second for video in the dataset

photo_extension - photo extensions in the dataset

photo_resolution - photo resolution in the dataset

Statistics for the dataset

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fef36e7e993c83b572df52283e13b1736%2Fhispanic_video.png?generation=1713530604439187&alt=media" alt="">

🧩 This is just an example of the data. Leave a request here to learn more

Content

The dataset consists of: - files - includes 10 folders corresponding to each person and including 1 image and 1 video, - .csv file - contains information about the files and people in the dataset

File with the extension .csv

id: id of the person,

selfie_link: link to access the photo,

video_link: link to access the video,

age: age of the person,

country: country of the person,

gender: gender of the person,

video_extension: video extension,

video_resolution: video resolution,

video_duration: video duration,

video_fps: frames per second for video,

photo_extension: photo extension,

photo_resolution: photo resolution

🚀 You can learn more about our high-quality unique datasets here

keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, hispanic people, hispanic classification, hispanic image dataset
Database on ICD-11 MBND course modes for Spanish-speaking clinicians
figshare.com
xlsx
Updated Nov 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebeca Robles (2024). Database on ICD-11 MBND course modes for Spanish-speaking clinicians [Dataset]. http://doi.org/10.6084/m9.figshare.27704220.v1
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.27704220.v1
Dataset updated
Nov 13, 2024
Dataset provided by
Figsharehttp://figshare.com/
Authors
Rebeca Robles
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Clinical Descriptions and Diagnostic Guidelines (CDDG) for the eleventh version of the WHO´s International Classification of Diseases (ICD-11), and Mental, Behavioral and Neurodevelopmental Disorders (MBND) constitute a substantial improvement over ICD-10 MBND CDDG. As part of the efforts to implement ICD-11 MBND CDDG in Spanish-speaking countries through continuing education for health professionals, this study was designed to evaluate the usefulness of a comprehensive online training course and its modalities (synchronous and asynchronous) to increase both the knowledge of and readiness to use this novel, evidence-based diagnostic tool. METHOD: A sample of Spanish-speaking psychiatrists, psychologists and general practitioners completed pre- and/or post-evaluations of one of the two modalities of ICD-11 MBND CDDG (asynchronous or synchronous). Knowledge of the material was evaluated at the end of the course through an ad hoc multiple-choice questionnaire, and readiness to implement ICD-11 MBND CDDG was evaluated before and after the course using an instrument based on the transtheoretical model developed by Prochaska and Diclemente, consisting of a linear scheme with five stages of change: precontemplation, contemplation, preparation, action, and maintenance. RESULTS: More women than men, younger health professionals and more clinicians from Mexico than any other country participated in the synchronous than in the asynchronous course. Prior to the course, most participants were at the pre-contemplation stage of readiness to implement the ICD-11 MBND CDDG. By the end of the course, participants reported a moderate level of knowledge of the ICD-11 MBND CDDG (with those in the synchronous course reporting higher levels of knowledge than those in the asynchronous one), while the percentage of clinicians at the preparation and action stages was higher than before the courses (with no differences being observed by course modes). CONCLUSIONS: Online training proved useful for achieving a moderate level of knowledge of the ICD-11 MBND CDDG and a substantial increase in clinicians’ readiness to implement them as part of their regular professional practice. Whichever course mode is preferred and feasible is recommended for interested Spanish-speaking clinicians.

Budget Share of Food for Spanish Households

kaggle.com

Updated Jul 2, 2023

+ more versions

Facebook

Twitter

Click to copy link

Link copied

Cite

Utkarsh Singh (2023). Budget Share of Food for Spanish Households [Dataset]. https://www.kaggle.com/datasets/utkarshx27/budget-share-of-food-for-spanish-households

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jul 2, 2023

Dataset provided by

Kaggle

Authors

Utkarsh Singh

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

number of observations : 23972
observation : households
country : Spain

Column	Description
wfood	percentage of total expenditure which the household has spent on food
totexp	total expenditure of the household
age	age of reference person in the household
size	size of the household
town	size of the town where the household is placed categorized into 5 groups: 1 for small towns, 5 for big ones
sex	sex of reference person (man,woman)

References Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/.

Facebook

Twitter

Click to copy link

Link copied

Cite

Frank Wong (2024). Spanish Spontaneous Dialogue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/spanish-spontaneous-dialogue-speech-dataset

Spanish Spontaneous Dialogue speech dataset

Explore at:

zip(93236 bytes)Available download formats

Dataset updated

Jun 7, 2024

Authors

Frank Wong

License

Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically

Description

Spanish(Spain) Spontaneous Dialogue Telephony speech dataset

Description

Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1234?source=Kaggle

Format

8kHz 8bit, a-law/u-law pcm, mono channel

Content category

Dialogue based on given topics

Recording condition

Low background noise (indoor)

Recording device

Telephony

Country

Spain(ESP)

Language(Region) Code

es-ES

Language

Spanish

Speaker

600 people in total, 49% male and 51% female

Features of annotation

Transcription text, timestamp, speaker ID, gender

Accuracy rate

Word accuracy rate(WAR) 98%

Licensing Information

Commercial License

Clear search

Close search

Google apps

Main menu

Spanish Spontaneous Dialogue speech dataset

Spanish(Spain) Spontaneous Dialogue Telephony speech dataset

Description

Format

Content category

Recording condition

Recording device

Country

Language(Region) Code

Language

Speaker

Features of annotation

Accuracy rate

Licensing Information

In-Car Speech Dataset: Spanish (Mexico)

Introduction

Speech Data

Dataset Diversity

Metadata

License

Spanish Language Datasets | 1.8M+ Sentences | Translation Data | TTS |...

Spanish Spontaneous Dialogue Telephony speech

88-Hours-Mexican-Spanish-Conversational-Speech-Data-by-Telephone

Description

Format

Content category

Recording condition

Recording device

Country

Language(Region) Code

Language

Speaker

Features of annotation

Accuracy rate

Licensing Information

In-Car Speech Dataset: Bulgarian (Bulgaria)

Introduction

Speech Data

Dataset Diversity

Metadata

License

In-Car Speech Dataset: Spanish (Argentina)

Introduction

Speech Data

Dataset Diversity

Metadata

License

Spanish Housing Dataset: Location, Size, Price,

Spanish Housing Dataset: Location, Size, Price, and More!

Now with 100% More Fun!

About this dataset

More Datasets

Featured Notebooks

How to use the dataset

Research Ideas

Acknowledgements

License

Columns

Table_1_Parental Burnout Assessment (PBA) in Different Hispanic Countries:...

16kHz Conversational Speech Data | 35,000 Hours | Large Language Model(LLM)...

COVID-19 Tweets : A dataset contaning more than 600k tweets on the novel...

Global English Speech with Accent Conversational Dataset — Multi-Region...

👨‍👩‍👧 US Country Demographics

Data Dictionary

LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...

Data_Sheet_1_Spanish Version of the Teachers’ Sense of Efficacy Scale: An...

Spanish conversation smart phone

Data from: Does educational level predict hearing aid self-efficacy in...

Data from: DIHANA corpus

Hispanic People - Liveness Detection Video Dataset

Biometric Attack Dataset, Hispanic People

The similar dataset that includes all ethnicities - Anti Spoofing Real Dataset

People in the dataset

Types of files in the dataset:

👉 Legally sourced datasets and carefully structured for AI training and model development. Explore samples from our dataset of 95,000+ human images & videos - Full dataset

Metadata for the full dataset:

Statistics for the dataset

🧩 This is just an example of the data. Leave a request here to learn more

Content

File with the extension .csv

Spanish Spontaneous Dialogue speech dataset