Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1234?source=Kaggle
8kHz 8bit, a-law/u-law pcm, mono channel
Dialogue based on given topics
Low background noise (indoor)
Telephony
Spain(ESP)
es-ES
Spanish
600 people in total, 49% male and 51% female
Transcription text, timestamp, speaker ID, gender
Word accuracy rate(WAR) 98%
Commercial License
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Mexican Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of Mexico1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
This Mexican Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
Facebook
TwitterLinguistically annotated Spanish language datasets with headwords, definitions, senses, examples, POS tags, semantic metadata, and usage info. Ideal for dictionary tools, NLP, and TTS model training or fine-tuning.
Our Spanish language datasets are carefully compiled and annotated by language and linguistic experts; you can find them available for licensing:
Key Features (approximate numbers):
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
Spanish sentences retrieved from the corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
Use Cases:
We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation, word embedding, and word sense disambiguation (WSD).
If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Oxford.Languages@oup.com to start the conversation.
Pricing:
Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.
Contact our team or email us at Oxford.Languages@oup.com to explore pricing options and discover how our language data can support your goals.
About the sample:
The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.
If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Spanish(Mexico) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics. Transcribed with text content, timestamp, speaker's ID, gender and other attributes. Our dataset was collected from extensive and diversify speakers(122 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link:https://www.nexdata.ai/datasets/speechrecog/1352?source=Kaggle
8kHz 8bit, a-law/u-law pcm, mono channel
Dialogue based on given topics
Low background noise (indoor)
Telephony
Mexico(MEX)
es-MX
Spanish
122 people in total, 53% male and 47% female
Transcription text, timestamp, speaker ID, gender, noise
Word accuracy rate(WAR) 98%
Commercial License
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the US Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of USA1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
This US Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
Facebook
Twitterhttps://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Welcome to the Argentinians Spanish Language In-car Speech Dataset, a comprehensive collection of audio recordings designed to facilitate the development of speech recognition models specifically tailored for in-car environments. This dataset aims to support research and innovation in automotive speech technology, enabling seamless and robust voice interactions within vehicles for drivers and co-passengers.
This dataset comprises over 5,000 high-quality audio recordings collected from various in-car environments. These recordings include scripted wake words and command-type prompts.
Participant Diversity:
- Speakers: 50+ native Spanish speakers from the FutureBeeAI Community.
- Regions: Ensures a balanced representation of Argentina1 accents, dialects, and demographics.
- Participant Profile: Participants range from 18 to 70 years old, representing both males and females in a 60:40 ratio, respectively.
Recording Nature: Scripted wake word and command type of audio recordings.
- Duration: Average duration of 5 to 20 seconds per audio recording.
- Formats: WAV format with mono channels, a bit depth of 16 bits. The dataset contains different data at 16kHz and 48kHz.
Apart from participant diversity, the dataset is diverse in terms of different wake words, voice commands, and recording environments.
Different Automobile Related Wake Words: Hey Mercedes, Hey BMW, Hey Porsche, Hey Volvo, Hey Audi, Hi Genesis, Hey Mini, Hey Toyota, Ok Ford, Hey Hyundai, Ok Honda, Hello Kia, Hey Dodge.
Different Cars: Data collection was carried out in different types and models of cars.
Different Types of Voice Commands:
- Navigational Voice Commands
- Mobile Control Voice Commands
- Car Control Voice Commands
- Multimedia & Entertainment Commands
- General, Question Answer, Search Commands
Recording Time: Participants recorded the given prompts at various times to make the dataset more diverse.
- Morning
- Afternoon
- Evening
Recording Environment: Various recording environments were captured to acquire more realistic data and to make the dataset inclusive of various types of noises. Some of the environment variables are as follows:
- Noise Level: Silent, Low Noise, Moderate Noise, High Noise
- Parking Location: Indoor, Outdoor
- Car Windows: Open, Closed
- Car AC: On, Off
- Car Engine: On, Off
- Car Movement: Stationary, Moving
The dataset provides comprehensive metadata for each audio recording and participant:
Participant Metadata: Unique identifier, age, gender, country, state, district, accent, and dialect.
Other Metadata: Recording transcript, recording environment, device details, sample rate, bit depth, file format, recording time.
This metadata is a powerful tool for understanding and characterizing the data, enabling informed decision-making in the development of Spanish voice assistant speech recognition models.
This Argentinians Spanish In-car audio dataset is created by FutureBeeAI and is available for commercial use.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
Looking for a place to live in Spain? This dataset contains information about houses in various Spanish provinces that will help you with your search! The data includes information about the houses such as location, size, price, amenities, and more. With this dataset, you can study the housing market in Spain, compare prices and styles of houses across different provinces, or learn more about the features of houses in different parts of the country. So whether you're looking for your dream home or just curious about Spanish real estate, this dataset is a great place to start!
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
The Spanish Housing Dataset contains information about houses in various Spanish provinces. The data includes information about the houses such as location, size, price, amenities, and so on. This dataset can be used to study the housing market in Spain, to compare prices and styles of houses in different provinces, or to find out more about the features of houses in different parts of
- To study the housing market in Spain and compare prices and styles of houses in different provinces
- To find out more about the features of houses in different parts of the country
- To compare prices and styles of houses in different parts of the province
If you use this dataset in your research, please credit the original authors.
License
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: addinfo.csv | Column name | Description | |:--------------|:-----------------------------------------------------------------| | poblacion | The population of the city where the house is located. (Numeric) | | source | The source of the data. (Categorical) |
File: links.csv | Column name | Description | |:-------------------|:------------------------------------------------------| | link | The URL of the listing. (String) | | num_link | The listing's unique identifier. (String) | | obtention_date | The date on which the listing was collected. (String) |
File: rentas_PV.csv
File: rentas_espanya.csv | Column name | Description | |:----------------------------|:------------------------------------------| | Número de declaraciones | The number of tax declarations. (Numeric) |
File: zones.csv | Column name | Description | |:--------------|:-------------------------------------| | type | The type of the house. (Categorical) |
File: houses_alava.csv | Column name | Description | |:----------------------|:------------------------------------------------------------------------------------| | obtention_date | The date on which the listing was collected. (String) | | ad_description | A description of the house. (String) | | ad_last_update | The date of the last update to the listing. (String) | | air_conditioner | A indicator of whether or not the house has air conditioning. (Boolean) | | balcony | A indicator of whether or not the house has a balcony. (Boolean) | | bath_num | The number of bathrooms in the house. (Integer) | | built_in_wardrobe | A indicator of whether or not the house has a built in wardrobe. (Boolean) | | chimney | A indicator of whether or not the house has a chimney. (Boolean) | | construct_date | The date the house was constructed. (String) | | energetic_certif | The energetic certification of the house. (String) | | **fl...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Parental burnout is a unique and context-specific syndrome resulting from a chronic imbalance of risks over resources in the parenting domain. The current research aims to evaluate the psychometric properties of the Spanish version of the Parental Burnout Assessment (PBA) across Spanish-speaking countries with two consecutive studies. In Study 1, we analyzed the data through a bifactor model within an Exploratory Structural Equation Modeling (ESEM) on the pooled sample of participants (N = 1,979) obtaining good fit indices. We then attained measurement invariance across both gender and countries in a set of nested models with gradually increasing parameter constraints. Latent means comparisons across countries showed that among the participants’ countries, Chile had the highest parental burnout score, likewise, comparisons across gender evidenced that mothers displayed higher scores than fathers, as shown in previous studies. Reliability coefficients were high. In Study 2 (N = 1,171), we tested the relations between parental burnout and three specific consequences, i.e., escape and suicidal ideations, parental neglect, and parental violence toward one’s children. The medium to large associations found provided support for the PBA’s predictive validity. Overall, we concluded that the Spanish version of the PBA has good psychometric properties. The results support its relevance for the assessment of parental burnout among Spanish-speaking parents, offering new opportunities for cross-cultural research in the parenting domain.
Facebook
TwitterNexdata has off-the-shelf 35,000 hours Multilingual Language Data of 16kHz conversational speech, covering 100+ countries including English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Hindi, Russia and etc.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains 653 996 tweets related to the Coronavirus topic and highlighted by hashtags such as: #COVID-19, #COVID19, #COVID, #Coronavirus, #NCoV and #Corona. The tweets' crawling period started on the 27th of February and ended on the 25th of March 2020, which is spread over four weeks.
The tweets were generated by 390 458 users from 133 different countries and were written in 61 languages. English being the most used language with almost 400k tweets, followed by Spanish with around 80k tweets.
The data is stored in as a CSV file, where each line represents a tweet. The CSV file provides information on the following fields:
Author: the user who posted the tweet
Recipient: contains the name of the user in case of a reply, otherwise it would have the same value as the previous field
Tweet: the full content of the tweet
Hashtags: the list of hashtags present in the tweet
Language: the language of the tweet
Relationship: gives information on the type of the tweet, whether it is a retweet, a reply, a tweet with a mention, etc.
Location: the country of the author of the tweet, which is unfortunately not always available
Date: the publication date of the tweet
Source: the device or platform used to send the tweet
The dataset can as well be used to construct a social graph since it includes the relations "Replies to", "Retweet", "MentionsInRetweet" and "Mentions".
Facebook
TwitterThe Global English Accent Conversational NLP Dataset is a comprehensive collection of validated English speech recordings sourced from native and non-native English speakers across key global regions. This dataset is designed for training Natural Language Processing models, conversational AI, Automatic Speech Recognition (ASR), and linguistic research, with a focus on regional accent variation.
Regions and Covered Countries with Primary Spoken Languages:
Africa: South Africa (English, Zulu, Afrikaans, Xhosa) Nigeria (English, Yoruba, Igbo, Hausa) Kenya (English, Swahili) Ghana (English, Twi, Ewe, Ga) Uganda (English, Luganda) Ethiopia (English, Amharic, Oromo)
Central & South America: Mexico (Spanish, English as a second language) Guatemala (Spanish, K'iche', English) El Salvador (Spanish, English) Costa Rica (Spanish, English in Caribbean regions) Colombia (Spanish, English in urban centers) Dominican Republic (Spanish, English in tourist zones) Brazil (Portuguese, English in urban areas) Argentina (Spanish, English among educated speakers)
Southeast Asia & South Asia: Philippines (Filipino, English) Vietnam (Vietnamese, English) Malaysia (Malay, English, Mandarin) Indonesia (Indonesian, Javanese, English) Singapore (English, Mandarin, Malay, Tamil) India (Hindi, English, Bengali, Tamil) Pakistan (Urdu, English, Punjabi)
Europe: United Kingdom (English) Ireland (English, Irish) Germany (German, English) France (French, English) Spain (Spanish, Catalan, English) Italy (Italian, English) Portugal (Portuguese, English)
Oceania: Australia (English) New Zealand (English, Māori) Fiji (English, Fijian) North America: United States (English, Spanish) Canada (English, French)
Dataset Attributes: - Conversational English with natural accent variation - Global coverage with balanced male/female speakers - Rich speaker metadata: age, gender, country, city - Average audio length of ~30 minutes per participant - All samples manually validated for accuracy - Structured format suitable for machine learning and AI applications
Best suited for: - NLP model training and evaluation - Multilingual ASR system development - Voice assistant and chatbot design - Accent recognition research - Voice synthesis and TTS modeling
This dataset ensures global linguistic diversity and delivers high-quality audio for AI developers, researchers, and enterprises working on voice-based applications.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
The following data set is information obtained about counties in the United States from 2010 through 2019 through the United States Census Bureau. Information described in the data includes the age distributions, the education levels, employment statistics, ethnicity percents, houseold information, income, and other miscellneous statistics. (Values are denoted as -1, if the data is not available)
| Key | List of... | Comment | Example Value |
|---|---|---|---|
| County | String | County name | "Abbeville County" |
| State | String | State name | "SC" |
| Age.Percent 65 and Older | Float | Estimated percentage of population whose ages are equal or greater than 65 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico). | 22.4 |
| Age.Percent Under 18 Years | Float | Estimated percentage of population whose ages are under 18 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico). | 19.8 |
| Age.Percent Under 5 Years | Float | Estimated percentage of population whose ages are under 5 years old are produced for the United States states and counties as well as for the Commonwealth of Puerto Rico and its municipios (county-equivalents for Puerto Rico). | 4.7 |
| Education.Bachelor's Degree or Higher | Float | Percentage for the people who attended college but did not receive a degree and people who received an associate's bachelor's master's or professional or doctorate degree. These data include only persons 25 years old and over. The percentages are obtained by dividing the counts of graduates by the total number of persons 25 years old and over. Tha data is collected from 2015 to 2019. | 15.6 |
| Education.High School or Higher | Float | Percentage of people whose highest degree was a high school diploma or its equivalent people who attended college but did not receive a degree and people who received an associate's bachelor's master's or professional or doctorate degree. These data include only persons 25 years old and over. The percentages are obtained by dividing the counts of graduates by the total number of persons 25 years old and over. Tha data is collected from 2015 to 2019 | 81.7 |
| Employment.Nonemployer Establishments | Integer | An establishment is a single physical location at which business is conducted or where services or industrial operations are performed. It is not necessarily identical with a company or enterprise which may consist of one establishment or more. The data was collected from 2018. | 1416 |
| Ethnicities.American Indian and Alaska Native Alone | Float | Estimated percentage of population having origins in any of the original peoples of North and South America (including Central America) and who maintains tribal affiliation or community attachment. This category includes people who indicate their race as "American Indian or Alaska Native" or report entries such as Navajo Blackfeet Inupiat Yup'ik or Central American Indian groups or South American Indian groups. | 0.3 |
| Ethnicities.Asian Alone | Float | Estimated percentage of population having origins in any of the original peoples of the Far East Southeast Asia or the Indian subcontinent including for example Cambodia China India Japan Korea Malaysia Pakistan the Philippine Islands Thailand and Vietnam. This includes people who reported detailed Asian responses such as: "Asian Indian " "Chinese " "Filipino " "Korean " "Japanese " "Vietnamese " and "Other Asian" or provide other detailed Asian responses. | 0.4 |
| Ethnicities.Black Alone | Float | Estimated percentage of population having origins in any of the Black racial groups of Africa. It includes people who indicate their race as "Black or African American " or report entries such as African American Kenyan Nigerian or Haitian. | 27.6 |
| Ethnicities.Hispanic or Latino | Float |
Facebook
TwitterLATAM Data Suite provides high-quality datasets in Spanish, Portuguese, and American English. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.
Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Sentences Curated examples of real-world usage with contextual annotations.
Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data Native speaker recordings for TTS and pronunciation modeling.
Word Lists Frequency-ranked and thematically grouped lists.
Learn more about the datasets included in the data suite:
Key Features (approximate numbers):
Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.
The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.
Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.
The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.
Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.
This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.
Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.
This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.
Our American English Monolingual Dictionary Data is the foremost au...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Teachers’ Sense of Efficacy Scale (TSES) has been the most widely used instrument to assess teacher efficacy beliefs. However, no study has been carried out concerning the TSES psychometric properties with teachers in Mexico, the country with the highest number of Spanish-speakers worldwide. The purpose of the present study is to examine the reliability, internal and external validity evidence of the TSES (short form) adapted into Spanish with a sample of 190 primary and secondary Mexican teachers from 25 private schools. Results of construct analysis confirm the three-factor-correlated structure of the original scale. Criterion validity evidence was established between self-efficacy and job satisfaction. Differences in self-efficacy were related to teachers’ gender, years of experience and grade level taught. Some limitations are discussed, and future research directions are recommended.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
如需完整数据集或了解更多,请发邮件至commercialproduct@appen.com For the complete dataset or more, please email commercialproduct@appen.com
*** THIS IS A SAMPLE DATABASE ONLY. THE INFORMATION CONTAINED IN *** THE REST OF THIS README APPLIES TO THE FULL DATABASE
This is a Spanish (EU) conversational database, produced by Appen Butler Hill Pty. Ltd. in 2021.
Appen Butler Hill Pty. Ltd. owns copyright of the database.
The database contains transcription and speech data recorded during 207 sessions. Each of the 207 unique speaker-pairs was recorded making conversations of approximately an average of 60 minutes.
Each pair of speakers recorded 4 to 12 conversations of approximately an average of 5-15 min on different topics. Speakers were provided with a topic for each conversation.
The recordings have been made using the Appen Mobile smartphone app with the phone positioned between two speakers who are in the same room.
The database contains approximately 223.48 hours of audio data in total.
The directory structure is designed to group data from each pair of speakers in a single folder. Each pair of speakers has been identified with a unique ID (Users_ID). This ID is also the name of each session folder. Within each session folder are the conversations made by the pair of speakers. The file naming structure indicates the language and country, date of recording, session ID (which is different from the speaker pair ID), and the conversational topic.
Directory Structure:
/+-COPYRIGHT.TXT
+-README.TXT
+-DEMOGRAPHICS.CSV
|
|
+-AUDIO----------+-1027536----------------+-SPAESP_20210123-224609-0006_Topic_Clothing.WAV
| | +-SPAESP_20210123-224609-0007_Topic_Insurance.WAV
| | .
| | .
| | +-SPAESP_20210123-224609-0017_Topic_Social.WAV
| |
| |
| +-1089511----------------+-SPAESP_20210220-140351-0001_Topic_Information.WAV
| | +-SPAESP_20210220-140351-0002_Topic_Insurance.WAV
| | .
| | .
| | +-SPAESP_20210220-140351-0013_Topic_Media.WAV
| .
| .
| .
| +-980733-----------------+-SPAESP_20210313-013101-0002_Topic_Travel.WAV
| +-SPAESP_20210313-013101-0003_Topic_Insurance.WAV
| .
| .
| +-SPAESP_20210313-013101-0011_Topic_Health.WAV
|
|
|
+-TRANSCRIPTION--+-1027536----------------+-SPAESP_20210123-224609-0006_Topic_Clothing.TXT
| | +-SPAESP_20210123-224609-0007_Topic_Insurance.TXT
| | .
| | .
| | +-SPAESP_20210123-224609-0017_Topic_Social.TXT
| |
| |
| +-1089511----------------+-SPAESP_20210220-140351-0001_Topic_Information.TXT
| | +-SPAESP_20210220-140351-0002_Topic_Insurance.TXT
| | .
| | .
| | +-SPAESP_20210220-140351-0013_Topic_Media.TXT
| .
| .
| .
| +-980733-----------------+-SPAESP_20210313-013101-0002_Topic_Travel.TXT
| +-SPAESP_20210313-013101-0003_Topic_Insurance.TXT
| .
| .
| +-SPAESP_20210313-013101-0011_Topic_Health.TXT
|
COPYRIGHT.TXT is a copyright document in ASCII format.
README.TXT is this file. It is an ASCII text file that describes the database.
DEMOGRAPHICS.CSV is an Excel file that contains the following fields: - Users_ID - Device_Model - Device_OS - Participant_1_Gender - Participant_1_Age - Participant_1_Dialect - Participant_2_Gender - Participant_2_Age - Participant_2_Dialect - Topics - Environment
CONVERSATION TOPICS: The conversations were spontaneous and were on a variety of generic topics (e.g. news, travel, study etc.). The topics can be found in the Demographics file.
Participants were provided with 12 topics to choose from. They needed to pick at least 4 topics and may skip up to 8 topics.
/AUDIO contains all audio data from the 207 sessions. Audio format is WAVE audio, Microsoft PCM, 16 bit, mono 48000 Hz
/TRANSCRIPTION contains all transcription data for the 207 sessions.
The Audio and Transcription filenames use the following template:
Facebook
TwitterHearing aids are the most common rehabilitation strategy for age-related hearing loss. However, 25% to 50% of older adults fitted with hearing aids do not wear them post-fitting. Hearing aid self-efficacy has been suggested as one of the key factors that may explain adherence to hearing aids in older adults. The primary aim of this study was to determine a possible association between educational level and hearing aid self-efficacy in older adult hearing aid users from a Latin American country (i.e., Chile). The secondary aim was to determine if in this sample of older adults, hearing aid self-efficacy predicted hearing aid adherence as previously suggested by other studies. The MARS-HA (Measure of Audiologic Rehabilitation Self-Efficacy for Hearing Aids) questionnaire was used to measure hearing aid self-efficacy. This questionnaire was initially adapted into Spanish (S-MARS-HA) using forward and backward translations by bilingual English-Spanish speakers. A sample of 252 older adults fitted with hearing aids at a public hospital in Santiago, Chile, was investigated. Educational level was measured as the number of years of formal education. Participants responded to the S-MARS-HA along with questions exploring social support, attitudes in using hearing aids, participation in social events, and vision and joint problems. Hearing aid adherence was investigated with the use of a question from the International Outcome Inventory for Hearing Aids. All these procedures were conducted at the participants’ homes. Pure-tone average (PTA; 500–4000 Hz) in the fitted ear was obtained from the participants’ medical records. Univariate and multivariate regression models were constructed to investigate the association between educational level and hearing aid self-efficacy controlling for the covariates of interest (e.g., social support, attitudes in using hearing aids, PTA). The S-MARS-HA showed an adequate construct validity along with a good reliability. Results of the multivariate regression analyses showed that educational level significantly predicted hearing aid self-efficacy. Covariates significantly associated with this outcome included attitudes in using hearing aids and PTA in the fitted ear. Finally, a significant association between hearing aid self-efficacy and adherence to hearing aid use was observed. In conclusion, this study showed a significant association between educational level and hearing aid self-efficacy in older adults from a developing Latin American country. Thus, this variable should be considered when designing and delivering aural rehabilitation programs such as hearing aids to older adults, especially those from developing countries.
Facebook
TwitterDIHANA is composed of 900 human-computer dialogues in Spanish. The acquisition of the DIHANA corpus was carried out by means of an initial prototype using the Wizard of Oz technique. The operator simulates speech recognition and understanding errors and the answers being synthesized according to a predefined set of templates. This acquisition was only restricted at the semantical level (i.e., the acquired dialogues are related to a specific task domain) and was not restricted at the lexical and syntactical level (spontaneous speech). In the acquisition process, this semantic control was provided by the definition of scenarios that the user must accomplish and by the wizard strategy, which defines the behavior of the acquisition system. The DIHANA task consists of the retrieval of information about Spanish nationwide trains by telephone. Several types of scenarios were defined in order to control the interaction of the user with the system. A scenario is defined by: an objective, the information needed by the user; a situation, the specific circumstances related to the trip request; and the specific requirements of the trip, type of trip, departure city, destination city, and one or more restrictions. The DIHANA corpus contains 5.5 hours of spontaneous speech corresponding to 6278 sentences. In total 225 speakers (153 males and 72 females) recorded 900 dialogues, resulting in 6,278 user turns. Along with the dialogues (speech signals), their full transcript is also provided and a lexicon phonetically containing all the words pronounced in the database. In addition a semantic tagging of the corpus and a labeling of the same corpus in terms of dialog acts is also provided. A more detailed description of DIHANA can be found in the "doc" subfolder and in the following papers: - N. Alcacer, J.M. Benedí, F. Blat, R. Granell, D. Martínez-Hinarejos and F. Torres: "Acquisition and Labelling of a Spontaneous Speech Dialogue Corpus". In proceedings of SPECOM, pages 583-586. Patras (Greece), October 2005. - J.M. Benedí E.Lleida, A. Varona, M.J.Castro, I.Galiano, R.Justo, I. López, and A. Miguel: "Design and acquisition of a telephone spontaneous speech dialogue corpus in spanish: DIHANA". In proceedings of LREC, pages 1636-1639, Genova, Italy, May 2006. Next we describe the contents of each subfolder: README: This file. data: The database: 75 speakers of 3 sites (Basque Country,
Aragon and Valencian Country) for 4 scenarios making a
total of 225 speakers (153 males and 72 females) with
900 dialogues. For each dialog is provided: - the speech signal of each user turn (.ul)
- the intermediate (.dis) and final (.xml) transcriptions
- the dialogue Act annotation + Dialogue Act annotation on transcription (.dia)
+ Dialogue Act annotation on categorised transcription,
without words for each category (.cad)
+ Dialogue Act annotation on categorised transcription,
with words for each category (.cwd) semdata: Semantic tagging of the full corpora, in the "data" subfolder,
and documentation describing the process of semátinco labeling,
in the "doc" folder. doc: Various documents (PDF) related to the design and acquisition
processes, the annotation format and event statistics. guides: 5 lists of 45 speakers, which account for 5 leaving-one-out
partitions. One of the lists can be alternatively chosen as
the test set, the other four joined to form the training set.
Under the folder corresponding to each speaker, the speech
signals and transcriptions corresponding to four dialogues
can be found, so each partition consists of 720 training
dialogues an 180 test dialogues. software: Various self-commented programs and utilities.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The dataset for face anti spoofing and face recognition includes images and videos of hispanic people. 32,600+ photos & video of 16,300 people from 20 countries. The dataset helps in enchancing the performance of the model by providing wider range of data for a specific ethnic group.
The videos were gathered by capturing faces of genuine individuals presenting spoofs, using facial presentations. Our dataset proposes a novel approach that learns and detects spoofing techniques, extracting features from the genuine facial images to prevent the capturing of such information by fake users.
The dataset contains images and videos of real humans with various resolutions, views, and colors, making it a comprehensive resource for researchers working on anti-spoofing technologies.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2F0af40cfdcb1e53ab635c56d179135f58%2FFrame%20107.png?generation=1713530575878904&alt=media" alt="">
Our dataset also explores the use of neural architectures, such as deep neural networks, to facilitate the identification of distinguishing patterns and textures in different regions of the face, increasing the accuracy and generalizability of the anti-spoofing models.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12421376%2Fef36e7e993c83b572df52283e13b1736%2Fhispanic_video.png?generation=1713530604439187&alt=media" alt="">
The dataset consists of: - files - includes 10 folders corresponding to each person and including 1 image and 1 video, - .csv file - contains information about the files and people in the dataset
🚀 You can learn more about our high-quality unique datasets here
keywords: liveness detection systems, liveness detection dataset, biometric dataset, biometric data dataset, biometric system attacks, anti-spoofing dataset, face liveness detection, deep learning dataset, face spoofing database, face anti-spoofing, ibeta dataset, face anti spoofing, large-scale face anti spoofing, rich annotations anti spoofing dataset, hispanic people, hispanic classification, hispanic image dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Clinical Descriptions and Diagnostic Guidelines (CDDG) for the eleventh version of the WHO´s International Classification of Diseases (ICD-11), and Mental, Behavioral and Neurodevelopmental Disorders (MBND) constitute a substantial improvement over ICD-10 MBND CDDG. As part of the efforts to implement ICD-11 MBND CDDG in Spanish-speaking countries through continuing education for health professionals, this study was designed to evaluate the usefulness of a comprehensive online training course and its modalities (synchronous and asynchronous) to increase both the knowledge of and readiness to use this novel, evidence-based diagnostic tool. METHOD: A sample of Spanish-speaking psychiatrists, psychologists and general practitioners completed pre- and/or post-evaluations of one of the two modalities of ICD-11 MBND CDDG (asynchronous or synchronous). Knowledge of the material was evaluated at the end of the course through an ad hoc multiple-choice questionnaire, and readiness to implement ICD-11 MBND CDDG was evaluated before and after the course using an instrument based on the transtheoretical model developed by Prochaska and Diclemente, consisting of a linear scheme with five stages of change: precontemplation, contemplation, preparation, action, and maintenance. RESULTS: More women than men, younger health professionals and more clinicians from Mexico than any other country participated in the synchronous than in the asynchronous course. Prior to the course, most participants were at the pre-contemplation stage of readiness to implement the ICD-11 MBND CDDG. By the end of the course, participants reported a moderate level of knowledge of the ICD-11 MBND CDDG (with those in the synchronous course reporting higher levels of knowledge than those in the asynchronous one), while the percentage of clinicians at the preparation and action stages was higher than before the courses (with no differences being observed by course modes). CONCLUSIONS: Online training proved useful for achieving a moderate level of knowledge of the ICD-11 MBND CDDG and a substantial increase in clinicians’ readiness to implement them as part of their regular professional practice. Whichever course mode is preferred and feasible is recommended for interested Spanish-speaking clinicians.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
number of observations : 23972
observation : households
country : Spain
| Column | Description |
|---|---|
| wfood | percentage of total expenditure which the household has spent on food |
| totexp | total expenditure of the household |
| age | age of reference person in the household |
| size | size of the household |
| town | size of the town where the household is placed categorized into 5 groups: 1 for small towns, 5 for big ones |
| sex | sex of reference person (man,woman) |
References Journal of Applied Econometrics data archive : http://qed.econ.queensu.ca/jae/.
Facebook
TwitterAttribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Spanish(Spain) Spontaneous Dialogue Telephony speech dataset, collected from dialogues based on given topics, covering 20+ domains. Transcribed with text content, speaker's ID, gender, age and other attributes. Our dataset was collected from extensive and diversify speakers(600 native speakers), geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1234?source=Kaggle
8kHz 8bit, a-law/u-law pcm, mono channel
Dialogue based on given topics
Low background noise (indoor)
Telephony
Spain(ESP)
es-ES
Spanish
600 people in total, 49% male and 51% female
Transcription text, timestamp, speaker ID, gender
Word accuracy rate(WAR) 98%
Commercial License