Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The dataset comes from Exploring the Construction of Chinese-English Terminology Knowledge Base in the Humanities and Social Sciences: Theory and Methods, published by Nanjing University Press, which is written to meet the needs of two-way information exchange and research in the field of humanities and social sciences in Chinese and English. In order to construct a high-quality dataset to improve the performance of the large language model and to cover more disciplinary categories in the humanities and social sciences, the methods of data expansion and feature extraction are adopted to preprocess the collected data, and to obtain a batch of high-quality Chinese-English terminology cross-referenced data corpus of different disciplinary categories in the field of humanities and social sciences. The collected Chinese-English data are combined to construct a Chinese-English bi-directional dataset. For the experimental work of instruction fine-tuning, a variety of instruction prompts are designed, and different prompts will significantly affect the output results of the model, and the final choice is to incorporate the disciplines of the humanities and social sciences into the structural attribute of instruction as the command fine-tuning. The final choice is to incorporate the disciplines of humanities and social sciences into the structural attribute "instruction" as the prompt for instruction fine-tuning.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle
16kHz, 16 bit, wav, mono channel;
Including self-media, conversation, live, lecture, variety-show, etc;
Low background noise;
America(USA);
en-US;
English;
Transcription text, timestamp, speaker ID, gender.
Sentence Accuracy Rate (SAR) 95%
Commercial License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Wiktionary Data on Hugging Face Datasets
wiktionary-data is a sub-data extraction of the English Wiktionary that currently supports the following languages:
Deutsch - German Latinum - Latin Ἑλληνική - Ancient Greek 한국어 - Korean 𐎠𐎼𐎹- Old Persian 𒀝𒅗𒁺𒌑(𒌝) - Akkadian Elamite संस्कृतम् - Sanskrit, or Classical Sanskrit
wiktionary-data was originally a sub-module of wilhelm-graphdb. While the dataset it's getting bigger, I noticed a wave of more exciting potentials this… See the full description on the dataset page: https://huggingface.co/datasets/paion-data/wiktionary-data.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Home Hispanic English DatasetHigh-Quality Hispanic English Call-Center and Podcast Dataset for AI & Speech Models Contact Us Call-Center Data Podcast Data Call-Center Data .elementor-58581 .elementor-element.elementor-element-91938a9{padding:20px:0px 50px;}.elementor-0 .elementor-element.elementor-element-58581f99d{padding:171px 0px…
Timeseries data from 'English Bay, AK' (noaa_nos_co_ops_9462641)
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Chats reflect informal, native-level English usage with:
Every chat instance is accompanied by structured metadata, which includes:
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
This ensures a clean, reliable dataset ready for high-performance AI model training.
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
French/English parallel texts for training translation models. Over 22.5 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other. This is the main dataset of Workshop on Statistical Machine Translation (WML) 2015 Dataset that can be used for Machine Translation and Language Models. Refer to the paper here: http://www.statmt.org/wmt15/pdf/WMT01.pdf
@InProceedings{bojar-EtAl:2015:WMT,
author = {Bojar, Ond\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Haddow, Barry and Huck, Matthias and Hokamp, Chris and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Scarton, Carolina and Specia, Lucia and Turchi, Marco},
title = {Findings of the 2015 Workshop on Statistical Machine Translation},
booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
month = {September},
year = {2015},
address = {Lisbon, Portugal},
publisher = {Association for Computational Linguistics},
pages = {1--46},
url = {http://aclweb.org/anthology/W15-3001}
}
Image Credits: Unsplash - chriskaridis
This directory contains a large-scale Hindi-English code-mixed corpus collected from Twitter between 2010-2022. We have removed the identifiers for anonymizing the dataset. We have de-anonymized the tweet author ids. Additionally, we have calculated code-mixing index (CMI) and the language of the texts (Hindi, English or, Hindi-English code-mixed).
m-aliabbas1/tiny-english-asr-sample-data dataset hosted on Hugging Face and contributed by the HF Datasets community
The Office for Civil Rights (OCR), U.S. Department of Education developed these materials in response to requests from school districts for a reference tool to assist them through the process of developing a comprehensive English language proficiency or English language learners (ELL) program.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT.
Github repository: https://github.com/Living-with-machines/histLM
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Ikhaya Boston English DatasetHigh-Quality Boston English Call-Center, General Conversation, kanye Podcast Dataset for AI & Speech Models Thintana nathi Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-57992 .elementor-element.elementor-element-91938a9{padding:20px
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.
Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.
https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJY
The Corpus of Historical American English (COHA) was created by Mark Davies, and it is the largest structured corpus of historical English. It is related to other corpora from English-Corpora.org, which are the most widely used corpora of English and which offer unparalleled insight into variation in English. COHA contains more than 475 million words of text from the 1820s-2010s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. The creation of the corpus results from a grant from the National Endowment for the Humanities (NEH) from 2008-2010.
14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.
English(Canada) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(466 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.
Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:
Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.
Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.
Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.
Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.
Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.
Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.
If you require more information about a specific dataset, please contact us Growth.OL@oup.com.
Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.
Arabic Monolingual Dictionary Data: 66,500 headwords | 98,700 senses | 70,000 examples.
Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 translation sentences.
Arabic Synonyms and Antonyms Data: 55,100 synonyms.
British English Monolingual Dictionary Data: 146,000 headwords | 230,000 senses | 149,000 examples.
British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms
British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.
Catalan Monolingual Dictionary Data: 29,800 headwords | 47,400 senses | 25,600 examples.
Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 translation sentences.
Croatian Monolingual Dictionary Data: 129,600 headwords | 164,760 senses | 34,630 examples.
Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 translation sentences.
Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 translation sentences.
Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 translation sentences.
French Monolingual Dictionary Data: 42,000 headwords | 56,000 senses | 43,000 examples.
French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 translation sentences.
German Monolingual Dictionary Data: 85,500 headwords | 78,000 senses | 55,000 examples.
German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 translation sentences.
German Word List Data: 338,000 wordforms.
Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 translation sentences.
Hebrew Monolingual Dictionary Data: 85,600 headwords | 104,100 senses | 94,000 examples.
Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 translation sentences.
Hungarian Monolingual Dictionary Data: 90,500 headwords | 155,300 senses | 42,500 examples.
Italian Monolingual Dictionary Data: 102,500 headwords | 231,580 senses | 48,200 examples.
Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 translation sentences.
Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.
Latvian Monolingual Dictionary Data: 36,000 headwords | 43,600 senses | 73,600 examples.
Persian Bilingual Dictionary Data: 30,660 translations | 19,780 senses | 30,660 translation sentences.
Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 translation sentences.
Portuguese Monolingual Dictionary Data: 143,600 headwords | 285,500 senses | 69,300 examples.
Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 translation sentences.
Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.
Romanian Monolingual Dictionary Data: 66,900 headwords | 113,500 senses | 2,700 examples.
Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 translation sentences.
Russian Monolingual Dictionary Data: 65,950 headwords | 57,500 senses | 51,900 examples.
Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 translation sentences.
Slovak Bilingual Dictionary Data: 254,300 translations | 172,100 senses | 85,000 translation sentences.
Spanish Monolingual Dictionary Data: 73,000 headwords | 123,000 senses | 104,000 examples.
Spanish Bilingu...
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf
https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf
The UK English Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 606 adult UK English speakers (325 males, 281 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place), and consisting of about 195 hours of audio data. 2) The second set comprises the recordings of 51 child UK English speakers (14 boys, 37 girls), recorded over 4 microphone channels in 1 recording environment (children room), and consisting of about 9 hours of audio data. This database is partitioned into 31 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items (over 290 items for adults and over 210 items for children):Calibration data: 6 noise recordings The “silence word” recordingFree spontaneous items (adults only):5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address208 application specific words and phrases per session (adults)74 toy commands, 14 phone commands and 34 general commands (children)The following age distribution has been obtained: Adults: 321 speakers are between 16 and 30, 182 speakers are between 31 and 45, 103 speakers are over 46.Children: All 51 speakers are between 11 and 14.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.
Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.
See https://github.com/mdoering/wikipedia-dwca for details.