100+ datasets found

Data from: English Wikipedia - Species Pages
gbif.org
demo.gbif.org
Updated Aug 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Explore at:
Unique identifier
https://doi.org/10.15468/c3kkgh
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.
S
Chinese-English Translation Dataset for Humanities and Social Sciences
scidb.cn
Updated Dec 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sun guang yao (2024). Chinese-English Translation Dataset for Humanities and Social Sciences [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00372
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.j00133.00372
Dataset updated
Dec 12, 2024
Dataset provided by
Science Data Bank
Authors
sun guang yao
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The dataset comes from Exploring the Construction of Chinese-English Terminology Knowledge Base in the Humanities and Social Sciences: Theory and Methods, published by Nanjing University Press, which is written to meet the needs of two-way information exchange and research in the field of humanities and social sciences in Chinese and English. In order to construct a high-quality dataset to improve the performance of the large language model and to cover more disciplinary categories in the humanities and social sciences, the methods of data expansion and feature extraction are adopted to preprocess the collected data, and to obtain a batch of high-quality Chinese-English terminology cross-referenced data corpus of different disciplinary categories in the field of humanities and social sciences. The collected Chinese-English data are combined to construct a Chinese-English bi-directional dataset. For the experimental work of instruction fine-tuning, a variety of instruction prompts are designed, and different prompts will significantly affect the output results of the model, and the final choice is to incorporate the disciplines of the humanities and social sciences into the structural attribute of instruction as the command fine-tuning. The final choice is to incorporate the disciplines of humanities and social sciences into the structural attribute "instruction" as the prompt for instruction fine-tuning.
English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
h
wiktionary-data
huggingface.co
Updated Nov 26, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paion Data (2024). wiktionary-data [Dataset]. https://huggingface.co/datasets/paion-data/wiktionary-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 26, 2024
Dataset authored and provided by
Paion Data
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Wiktionary Data on Hugging Face Datasets

wiktionary-data is a sub-data extraction of the English Wiktionary that currently supports the following languages:

Deutsch - German Latinum - Latin Ἑλληνική - Ancient Greek 한국어 - Korean 𐎠𐎼𐎹- Old Persian 𒀝𒅗𒁺𒌑(𒌝) - Akkadian Elamite संस्कृतम् - Sanskrit, or Classical Sanskrit

wiktionary-data was originally a sub-module of wilhelm-graphdb. While the dataset it's getting bigger, I noticed a wave of more exciting potentials this… See the full description on the dataset page: https://huggingface.co/datasets/paion-data/wiktionary-data.
s
Hispanic English Dataset
hmn.shaip.com
Updated Aug 8, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2024). Hispanic English Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/hispanic-english-english-dataset/
Explore at:
Dataset updated
Aug 8, 2024
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Home Hispanic English DatasetHigh-Quality Hispanic English Call-Center and Podcast Dataset for AI & Speech Models Contact Us Call-Center Data Podcast Data Call-Center Data .elementor-58581 .elementor-element.elementor-element-91938a9{padding:20px:0px 50px;}.elementor-0 .elementor-element.elementor-element-58581f99d{padding:171px 0px…
English Bay, AK
catalog.data.gov
data.ioos.us
Updated Aug 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA Center for Operational Oceanographic Products and Services (CO-OPS) (Point of Contact) (2025). English Bay, AK [Dataset]. https://catalog.data.gov/dataset/english-bay-ak2
Explore at:
Dataset updated
Aug 27, 2025
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Area covered
English Bay
Description
Timeseries data from 'English Bay, AK' (noaa_nos_co_ops_9462641)
F
English Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native English speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level English usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
a
Languages and English Ability - Seattle Neighborhoods
data-seattlecitygis.opendata.arcgis.com
data.seattle.gov
+4more
Updated Feb 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Seattle ArcGIS Online (2024). Languages and English Ability - Seattle Neighborhoods [Dataset]. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::languages-and-english-ability-seattle-neighborhoods
Explore at:
Dataset updated
Feb 22, 2024
Dataset authored and provided by
City of Seattle ArcGIS Online
Area covered
Seattle
Description
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Data from: English-French Translation Dataset
kaggle.com
zip
Updated Feb 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dhruvil Dave (2021). English-French Translation Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1926230
Explore at:
zip(2731897777 bytes)Available download formats
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1926230
Dataset updated
Feb 9, 2021
Authors
Dhruvil Dave
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Description

French/English parallel texts for training translation models. Over 22.5 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other. This is the main dataset of Workshop on Statistical Machine Translation (WML) 2015 Dataset that can be used for Machine Translation and Language Models. Refer to the paper here: http://www.statmt.org/wmt15/pdf/WMT01.pdf

Citation

@InProceedings{bojar-EtAl:2015:WMT, author = {Bojar, Ond\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Haddow, Barry and Huck, Matthias and Hokamp, Chris and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Scarton, Carolina and Specia, Lucia and Turchi, Marco}, title = {Findings of the 2015 Workshop on Statistical Machine Translation}, booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation}, month = {September}, year = {2015}, address = {Lisbon, Portugal}, publisher = {Association for Computational Linguistics}, pages = {1--46}, url = {http://aclweb.org/anthology/W15-3001} }

Image Credits: Unsplash - chriskaridis
d
Replication Data for: Hindi-English code-mixed Twitter dataset
search.dataone.org
dataverse.harvard.edu
Updated Feb 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymous (2024). Replication Data for: Hindi-English code-mixed Twitter dataset [Dataset]. http://doi.org/10.7910/DVN/BIUUW4
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/BIUUW4
Dataset updated
Feb 7, 2024
Dataset provided by
Harvard Dataverse
Authors
Anonymous
Description
This directory contains a large-scale Hindi-English code-mixed corpus collected from Twitter between 2010-2022. We have removed the identifiers for anonymizing the dataset. We have de-anonymized the tweet author ids. Additionally, we have calculated code-mixing index (CMI) and the language of the texts (Hindi, English or, Hindi-English code-mixed).
h
tiny-english-asr-sample-data
huggingface.co
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Ali Abbas (2025). tiny-english-asr-sample-data [Dataset]. https://huggingface.co/datasets/m-aliabbas1/tiny-english-asr-sample-data
Explore at:
Dataset updated
Jul 25, 2025
Authors
Muhammad Ali Abbas
Description
m-aliabbas1/tiny-english-asr-sample-data dataset hosted on Hugging Face and contributed by the HF Datasets community
d
Programs for English Language Learners (OCR)
catalog.data.gov
data.amerigeoss.org
Updated Aug 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Office for Civil Rights (OCR) (2023). Programs for English Language Learners (OCR) [Dataset]. https://catalog.data.gov/dataset/programs-for-english-language-learners-ocr
Explore at:
Dataset updated
Aug 13, 2023
Dataset provided by
Office for Civil Rights (OCR)
Description
The Office for Civil Rights (OCR), U.S. Department of Education developed these materials in response to requests from school districts for a reference tool to assist them through the process of developing a comprehensive English language proficiency or English language learners (ELL) program.
Z
Data from: Neural Language Models for Nineteenth-Century English (dataset;...
data.niaid.nih.gov
Updated May 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beelen, Kaspar (2021). Neural Language Models for Nineteenth-Century English (dataset; language model zoo) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4779090
Explore at:
Dataset updated
May 23, 2021
Dataset provided by
Coll Ardanuy, Mariona
Beelen, Kaspar
Hosseini, Kasra
Colavizza, Giovanni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT.

Github repository: https://github.com/Living-with-machines/histLM
s
I-Boston English Dataset
zu.shaip.com
Updated Jun 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaip (2023). I-Boston English Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/boston-english-dataset/
Explore at:
Dataset updated
Jun 15, 2023
Dataset authored and provided by
Shaip
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Boston
Description
Ikhaya Boston English DatasetHigh-Quality Boston English Call-Center, General Conversation, kanye Podcast Dataset for AI & Speech Models Thintana nathi Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-57992 .elementor-element.elementor-element-91938a9{padding:20px
Predict Future Sales (translated to English)
kaggle.com
Updated Nov 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
YWenLin (2020). Predict Future Sales (translated to English) [Dataset]. https://www.kaggle.com/datasets/ywhenlyn/predict-future-sales-translated-to-english/versions/2
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 24, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
YWenLin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.

File Information

Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.

Column Description

ID - an Id that represents a (Shop, Item) tuple within the test set

shop_id - unique identifier of a shop

item_id - unique identifier of a product

item_name - name of item

shop_name - name of shop

item_category_name - name of item category
H
Data from: The Corpus of Historical American English (COHA)
dataverse.harvard.edu
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Davies (2025). The Corpus of Historical American English (COHA) [Dataset]. http://doi.org/10.7910/DVN/IFMZJY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/IFMZJY
Dataset updated
Apr 24, 2025
Dataset provided by
Harvard Dataverse
Authors
Mark Davies
License
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJY
Description
The Corpus of Historical American English (COHA) was created by Mark Davies, and it is the largest structured corpus of historical English. It is related to other corpora from English-Corpora.org, which are the most widely used corpora of English and which offer unparalleled insight into variation in English. COHA contains more than 475 million words of text from the 1820s-2010s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. The creation of the corpus results from a grant from the National Endowment for the Humanities (NEH) from 2008-2010.
n
14,511 Images English Handwriting OCR Data
nexdata.ai
m.nexdata.ai
Updated Sep 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 14,511 Images English Handwriting OCR Data [Dataset]. https://www.nexdata.ai/datasets/ocr/1215
Explore at:
Dataset updated
Sep 29, 2023
Dataset provided by
nexdata technology inc
Nexdata
Authors
Nexdata
Variables measured
Device, Accuracy, Data size, Data format, Data content, Photographic angle, Collecting environment, Population distribution, Nationality distribution
Description
14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.
207 Hours – Canadian Speaking English Speech Data by Mobile Phone
nexdata.ai
Updated Nov 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nexdata (2023). 207 Hours – Canadian Speaking English Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1047
Explore at:
Dataset updated
Nov 21, 2023
Dataset authored and provided by
Nexdata
Area covered
Canada
Variables measured
Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition
Description
English(Canada) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(466 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
EMEA Data Suite | 3.3M Translations | 1.9M Words | 23 Languages | Natural...
datarade.ai
Updated Aug 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). EMEA Data Suite | 3.3M Translations | 1.9M Words | 23 Languages | Natural Language Processing (NLP) Data | Translation Data | TTS | EMEA Coverage [Dataset]. https://datarade.ai/data-products/emea-data-suite-3-3m-translations-1-9m-words-23-languag-oxford-languages
Explore at:
.csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
Dataset updated
Aug 8, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
Syrian Arab Republic, Burundi, Uganda, Israel, Seychelles, Spain, Romania, Central African Republic, Bosnia and Herzegovina, Morocco
Description
Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.

Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.

Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

Arabic Monolingual Dictionary Data: 66,500 headwords | 98,700 senses | 70,000 examples.

Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 translation sentences.

Arabic Synonyms and Antonyms Data: 55,100 synonyms.

British English Monolingual Dictionary Data: 146,000 headwords | 230,000 senses | 149,000 examples.

British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms

British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.

Catalan Monolingual Dictionary Data: 29,800 headwords | 47,400 senses | 25,600 examples.

Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 translation sentences.

Croatian Monolingual Dictionary Data: 129,600 headwords | 164,760 senses | 34,630 examples.

Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 translation sentences.

Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 translation sentences.

Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 translation sentences.

French Monolingual Dictionary Data: 42,000 headwords | 56,000 senses | 43,000 examples.

French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 translation sentences.

German Monolingual Dictionary Data: 85,500 headwords | 78,000 senses | 55,000 examples.

German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 translation sentences.

German Word List Data: 338,000 wordforms.

Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 translation sentences.

Hebrew Monolingual Dictionary Data: 85,600 headwords | 104,100 senses | 94,000 examples.

Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 translation sentences.

Hungarian Monolingual Dictionary Data: 90,500 headwords | 155,300 senses | 42,500 examples.

Italian Monolingual Dictionary Data: 102,500 headwords | 231,580 senses | 48,200 examples.

Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 translation sentences.

Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.

Latvian Monolingual Dictionary Data: 36,000 headwords | 43,600 senses | 73,600 examples.

Persian Bilingual Dictionary Data: 30,660 translations | 19,780 senses | 30,660 translation sentences.

Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 translation sentences.

Portuguese Monolingual Dictionary Data: 143,600 headwords | 285,500 senses | 69,300 examples.

Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 translation sentences.

Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.

Romanian Monolingual Dictionary Data: 66,900 headwords | 113,500 senses | 2,700 examples.

Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 translation sentences.

Russian Monolingual Dictionary Data: 65,950 headwords | 57,500 senses | 51,900 examples.

Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 translation sentences.

Slovak Bilingual Dictionary Data: 254,300 translations | 172,100 senses | 85,000 translation sentences.

Spanish Monolingual Dictionary Data: 73,000 headwords | 123,000 senses | 104,000 examples.

Spanish Bilingu...
E
UK English Speecon database
catalogue.elra.info
Updated Feb 22, 2007
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). UK English Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0215/
Explore at:
Dataset updated
Feb 22, 2007
Dataset provided by
ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
ELRA (European Language Resources Association)
License
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
Area covered
United Kingdom
Description
The UK English Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 606 adult UK English speakers (325 males, 281 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place), and consisting of about 195 hours of audio data. 2) The second set comprises the recordings of 51 child UK English speakers (14 boys, 37 girls), recorded over 4 microphone channels in 1 recording environment (children room), and consisting of about 9 hours of audio data. This database is partitioned into 31 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items (over 290 items for adults and over 210 items for children):Calibration data: 6 noise recordings The “silence word” recordingFree spontaneous items (adults only):5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address208 application specific words and phrases per session (adults)74 toy commands, 14 phone commands and 34 general commands (children)The following age distribution has been obtained: Adults: 321 speakers are between 16 and 30, 182 speakers are between 31 and 45, 103 speakers are over 46.Children: All 51 speakers are between 11 and 14.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

Facebook

Twitter

Click to copy link

Link copied

Cite

Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh

Data from: English Wikipedia - Species Pages

Explore at:

31 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.15468/c3kkgh

Dataset updated

Aug 23, 2022

Dataset provided by

Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/

Authors

Markus Döring; Markus Döring

Description

Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.

Clear search

Close search

Google apps

Main menu

Data from: English Wikipedia - Species Pages

Chinese-English Translation Dataset for Humanities and Social Sciences

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

wiktionary-data

Hispanic English Dataset

English Bay, AK

English Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

Languages and English Ability - Seattle Neighborhoods

Data from: English-French Translation Dataset

Description

Citation

Replication Data for: Hindi-English code-mixed Twitter dataset

tiny-english-asr-sample-data

Programs for English Language Learners (OCR)

Data from: Neural Language Models for Nineteenth-Century English (dataset;...

I-Boston English Dataset

Predict Future Sales (translated to English)

File Information

Column Description

Data from: The Corpus of Historical American English (COHA)

14,511 Images English Handwriting OCR Data

207 Hours – Canadian Speaking English Speech Data by Mobile Phone

EMEA Data Suite | 3.3M Translations | 1.9M Words | 23 Languages | Natural...

UK English Speecon database

Data from: English Wikipedia - Species PagesSee More Versions

Data from: English Wikipedia - Species Pages