100+ datasets found
  1. Data from: English Wikipedia - Species Pages

    • gbif.org
    • demo.gbif.org
    Updated Aug 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
    Explore at:
    Dataset updated
    Aug 23, 2022
    Dataset provided by
    Wikimedia Foundationhttp://www.wikimedia.org/
    Global Biodiversity Information Facilityhttps://www.gbif.org/
    Authors
    Markus Döring; Markus Döring
    Description

    Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

    See https://github.com/mdoering/wikipedia-dwca for details.

  2. S

    Chinese-English Translation Dataset for Humanities and Social Sciences

    • scidb.cn
    Updated Dec 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sun guang yao (2024). Chinese-English Translation Dataset for Humanities and Social Sciences [Dataset]. http://doi.org/10.57760/sciencedb.j00133.00372
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2024
    Dataset provided by
    Science Data Bank
    Authors
    sun guang yao
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The dataset comes from Exploring the Construction of Chinese-English Terminology Knowledge Base in the Humanities and Social Sciences: Theory and Methods, published by Nanjing University Press, which is written to meet the needs of two-way information exchange and research in the field of humanities and social sciences in Chinese and English. In order to construct a high-quality dataset to improve the performance of the large language model and to cover more disciplinary categories in the humanities and social sciences, the methods of data expansion and feature extraction are adopted to preprocess the collected data, and to obtain a batch of high-quality Chinese-English terminology cross-referenced data corpus of different disciplinary categories in the field of humanities and social sciences. The collected Chinese-English data are combined to construct a Chinese-English bi-directional dataset. For the experimental work of instruction fine-tuning, a variety of instruction prompts are designed, and different prompts will significantly affect the output results of the model, and the final choice is to incorporate the disciplines of the humanities and social sciences into the structural attribute of instruction as the command fine-tuning. The final choice is to incorporate the disciplines of humanities and social sciences into the structural attribute "instruction" as the prompt for instruction fine-tuning.

  3. English Conversation and Monologue speech dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

    Format

    16kHz, 16 bit, wav, mono channel;

    Content category

    Including self-media, conversation, live, lecture, variety-show, etc;

    Recording environment

    Low background noise;

    Country

    America(USA);

    Language(Region) Code

    en-US;

    Language

    English;

    Features of annotation

    Transcription text, timestamp, speaker ID, gender.

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%

    Licensing Information

    Commercial License

  4. h

    wiktionary-data

    • huggingface.co
    Updated Nov 26, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paion Data (2024). wiktionary-data [Dataset]. https://huggingface.co/datasets/paion-data/wiktionary-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 26, 2024
    Dataset authored and provided by
    Paion Data
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Wiktionary Data on Hugging Face Datasets

    wiktionary-data is a sub-data extraction of the English Wiktionary that currently supports the following languages:

    Deutsch - German Latinum - Latin Ἑλληνική - Ancient Greek 한국어 - Korean 𐎠𐎼𐎹- Old Persian 𒀝𒅗𒁺𒌑(𒌝) - Akkadian Elamite संस्कृतम् - Sanskrit, or Classical Sanskrit

    wiktionary-data was originally a sub-module of wilhelm-graphdb. While the dataset it's getting bigger, I noticed a wave of more exciting potentials this… See the full description on the dataset page: https://huggingface.co/datasets/paion-data/wiktionary-data.

  5. s

    Hispanic English Dataset

    • hmn.shaip.com
    Updated Aug 8, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2024). Hispanic English Dataset [Dataset]. https://hmn.shaip.com/offerings/speech-data-catalog/hispanic-english-english-dataset/
    Explore at:
    Dataset updated
    Aug 8, 2024
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Home Hispanic English DatasetHigh-Quality Hispanic English Call-Center and Podcast Dataset for AI & Speech Models Contact Us Call-Center Data Podcast Data Call-Center Data .elementor-58581 .elementor-element.elementor-element-91938a9{padding:20px:0px 50px;}.elementor-0 .elementor-element.elementor-element-58581f99d{padding:171px 0px…

  6. English Bay, AK

    • catalog.data.gov
    • data.ioos.us
    Updated Aug 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA Center for Operational Oceanographic Products and Services (CO-OPS) (Point of Contact) (2025). English Bay, AK [Dataset]. https://catalog.data.gov/dataset/english-bay-ak2
    Explore at:
    Dataset updated
    Aug 27, 2025
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    Area covered
    English Bay
    Description

    Timeseries data from 'English Bay, AK' (noaa_nos_co_ops_9462641)

  7. F

    English Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native English speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level English usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  8. a

    Languages and English Ability - Seattle Neighborhoods

    • data-seattlecitygis.opendata.arcgis.com
    • data.seattle.gov
    • +4more
    Updated Feb 22, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Seattle ArcGIS Online (2024). Languages and English Ability - Seattle Neighborhoods [Dataset]. https://data-seattlecitygis.opendata.arcgis.com/datasets/SeattleCityGIS::languages-and-english-ability-seattle-neighborhoods
    Explore at:
    Dataset updated
    Feb 22, 2024
    Dataset authored and provided by
    City of Seattle ArcGIS Online
    Area covered
    Seattle
    Description

    Table from the American Community Survey (ACS) 5-year series on languages spoken and English ability related topics for City of Seattle Council Districts, Comprehensive Plan Growth Areas and Community Reporting Areas. Table includes B16004 Age by Language Spoken at Home by Ability to Speak English, C16002 Household Language by Household Limited English-Speaking Status. Data is pulled from block group tables for the most recent ACS vintage and summarized to the neighborhoods based on block group assignment.Table created for and used in the Neighborhood Profiles application.Vintages: 2023ACS Table(s): B16004, C16002Data downloaded from: Census Bureau's Explore Census Data The United States Census Bureau's American Community Survey (ACS):About the SurveyGeography & ACSTechnical DocumentationNews & UpdatesThis ready-to-use layer can be used within ArcGIS Pro, ArcGIS Online, its configurable apps, dashboards, Story Maps, custom apps, and mobile apps. Data can also be exported for offline workflows. Please cite the Census and ACS when using this data.Data Note from the Census:Data are based on a sample and are subject to sampling variability. The degree of uncertainty for an estimate arising from sampling variability is represented through the use of a margin of error. The value shown here is the 90 percent margin of error. The margin of error can be interpreted as providing a 90 percent probability that the interval defined by the estimate minus the margin of error and the estimate plus the margin of error (the lower and upper confidence bounds) contains the true value. In addition to sampling variability, the ACS estimates are subject to nonsampling error (for a discussion of nonsampling variability, see Accuracy of the Data). The effect of nonsampling error is not represented in these tables.Data Processing Notes:Boundaries come from the US Census TIGER geodatabases, specifically, the National Sub-State Geography Database (named tlgdb_(year)_a_us_substategeo.gdb). Boundaries are updated at the same time as the data updates (annually), and the boundary vintage appropriately matches the data vintage as specified by the Census. These are Census boundaries with water and/or coastlines erased for cartographic and mapping purposes. For census tracts, the water cutouts are derived from a subset of the 2020 Areal Hydrography boundaries offered by TIGER. Water bodies and rivers which are 50 million square meters or larger (mid to large sized water bodies) are erased from the tract level boundaries, as well as additional important features. For state and county boundaries, the water and coastlines are derived from the coastlines of the 2020 500k TIGER Cartographic Boundary Shapefiles. These are erased to more accurately portray the coastlines and Great Lakes. The original AWATER and ALAND fields are still available as attributes within the data table (units are square meters). The States layer contains 52 records - all US states, Washington D.C., and Puerto RicoCensus tracts with no population that occur in areas of water, such as oceans, are removed from this data service (Census Tracts beginning with 99).Percentages and derived counts, and associated margins of error, are calculated values (that can be identified by the "_calc_" stub in the field name), and abide by the specifications defined by the American Community Survey.Field alias names were created based on the Table Shells file available from the American Community Survey Summary File Documentation page.Negative values (e.g., -4444...) have been set to null, with the exception of -5555... which has been set to zero. These negative values exist in the raw API data to indicate the following situations:The margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.Either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution.The median falls in the lowest interval of an open-ended distribution, or in the upper interval of an open-ended distribution. A statistical test is not appropriate.The estimate is controlled. A statistical test for sampling variability is not appropriate.The data for this geographic area cannot be displayed because the number of sample cases is too small.

  9. Data from: English-French Translation Dataset

    • kaggle.com
    zip
    Updated Feb 9, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dhruvil Dave (2021). English-French Translation Dataset [Dataset]. http://doi.org/10.34740/kaggle/dsv/1926230
    Explore at:
    zip(2731897777 bytes)Available download formats
    Dataset updated
    Feb 9, 2021
    Authors
    Dhruvil Dave
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Description

    French/English parallel texts for training translation models. Over 22.5 million sentences in French and English. Dataset created by Chris Callison-Burch, who crawled millions of web pages and then used a set of simple heuristics to transform French URLs onto English URLs, and assumed that these documents are translations of each other. This is the main dataset of Workshop on Statistical Machine Translation (WML) 2015 Dataset that can be used for Machine Translation and Language Models. Refer to the paper here: http://www.statmt.org/wmt15/pdf/WMT01.pdf

    Citation

    @InProceedings{bojar-EtAl:2015:WMT,
     author  = {Bojar, Ond\v{r}ej and Chatterjee, Rajen and Federmann, Christian and Haddow, Barry and Huck, Matthias and Hokamp, Chris and Koehn, Philipp and Logacheva, Varvara and Monz, Christof and Negri, Matteo and Post, Matt and Scarton, Carolina and Specia, Lucia and Turchi, Marco},
     title   = {Findings of the 2015 Workshop on Statistical Machine Translation},
     booktitle = {Proceedings of the Tenth Workshop on Statistical Machine Translation},
     month   = {September},
     year   = {2015},
     address  = {Lisbon, Portugal},
     publisher = {Association for Computational Linguistics},
     pages   = {1--46},
     url    = {http://aclweb.org/anthology/W15-3001}
    }
    

    Image Credits: Unsplash - chriskaridis

  10. d

    Replication Data for: Hindi-English code-mixed Twitter dataset

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Feb 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous (2024). Replication Data for: Hindi-English code-mixed Twitter dataset [Dataset]. http://doi.org/10.7910/DVN/BIUUW4
    Explore at:
    Dataset updated
    Feb 7, 2024
    Dataset provided by
    Harvard Dataverse
    Authors
    Anonymous
    Description

    This directory contains a large-scale Hindi-English code-mixed corpus collected from Twitter between 2010-2022. We have removed the identifiers for anonymizing the dataset. We have de-anonymized the tweet author ids. Additionally, we have calculated code-mixing index (CMI) and the language of the texts (Hindi, English or, Hindi-English code-mixed).

  11. h

    tiny-english-asr-sample-data

    • huggingface.co
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ali Abbas (2025). tiny-english-asr-sample-data [Dataset]. https://huggingface.co/datasets/m-aliabbas1/tiny-english-asr-sample-data
    Explore at:
    Dataset updated
    Jul 25, 2025
    Authors
    Muhammad Ali Abbas
    Description

    m-aliabbas1/tiny-english-asr-sample-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  12. d

    Programs for English Language Learners (OCR)

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Aug 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Office for Civil Rights (OCR) (2023). Programs for English Language Learners (OCR) [Dataset]. https://catalog.data.gov/dataset/programs-for-english-language-learners-ocr
    Explore at:
    Dataset updated
    Aug 13, 2023
    Dataset provided by
    Office for Civil Rights (OCR)
    Description

    The Office for Civil Rights (OCR), U.S. Department of Education developed these materials in response to requests from school districts for a reference tool to assist them through the process of developing a comprehensive English language proficiency or English language learners (ELL) program.

  13. Z

    Data from: Neural Language Models for Nineteenth-Century English (dataset;...

    • data.niaid.nih.gov
    Updated May 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beelen, Kaspar (2021). Neural Language Models for Nineteenth-Century English (dataset; language model zoo) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4779090
    Explore at:
    Dataset updated
    May 23, 2021
    Dataset provided by
    Coll Ardanuy, Mariona
    Beelen, Kaspar
    Hosseini, Kasra
    Colavizza, Giovanni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains four types of neural language models trained on a large historical dataset of books in English, published between 1760-1900 and comprised of ~5.1 billion tokens. The language model architectures include static (word2vec and fastText) and contextualized models (BERT and Flair). For each architecture, we trained a model instance using the whole dataset. Additionally, we trained separate instances on text published before 1850 for the two static models, and four instances considering different time slices for BERT.

    Github repository: https://github.com/Living-with-machines/histLM

  14. s

    I-Boston English Dataset

    • zu.shaip.com
    Updated Jun 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaip (2023). I-Boston English Dataset [Dataset]. https://zu.shaip.com/offerings/speech-data-catalog/boston-english-dataset/
    Explore at:
    Dataset updated
    Jun 15, 2023
    Dataset authored and provided by
    Shaip
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    Boston
    Description

    Ikhaya Boston English DatasetHigh-Quality Boston English Call-Center, General Conversation, kanye Podcast Dataset for AI & Speech Models Thintana nathi Call-Center Data General Conversation Data Podcast Data Call-Center Data .elementor-57992 .elementor-element.elementor-element-91938a9{padding:20px

  15. Predict Future Sales (translated to English)

    • kaggle.com
    Updated Nov 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    YWenLin (2020). Predict Future Sales (translated to English) [Dataset]. https://www.kaggle.com/datasets/ywhenlyn/predict-future-sales-translated-to-english/versions/2
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 24, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    YWenLin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original data from Predict Future Sales (Kaggle Competition) Translated items_categories.csv, shops.csv, items.csv from Russian to English for easy features engineering and references.

    File Information

    Translated item description and shop name from Russian to English items.csv - supplemental information about the items/products. item_categories.csv - supplemental information about the items categories. shops.csv- supplemental information about the shops.

    Column Description

    • ID - an Id that represents a (Shop, Item) tuple within the test set
    • shop_id - unique identifier of a shop
    • item_id - unique identifier of a product
    • item_name - name of item
    • shop_name - name of shop
    • item_category_name - name of item category
  16. H

    Data from: The Corpus of Historical American English (COHA)

    • dataverse.harvard.edu
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Davies (2025). The Corpus of Historical American English (COHA) [Dataset]. http://doi.org/10.7910/DVN/IFMZJY
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Mark Davies
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJYhttps://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/IFMZJY

    Description

    The Corpus of Historical American English (COHA) was created by Mark Davies, and it is the largest structured corpus of historical English. It is related to other corpora from English-Corpora.org, which are the most widely used corpora of English and which offer unparalleled insight into variation in English. COHA contains more than 475 million words of text from the 1820s-2010s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. The creation of the corpus results from a grant from the National Endowment for the Humanities (NEH) from 2008-2010.

  17. n

    14,511 Images English Handwriting OCR Data

    • nexdata.ai
    • m.nexdata.ai
    Updated Sep 29, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 14,511 Images English Handwriting OCR Data [Dataset]. https://www.nexdata.ai/datasets/ocr/1215
    Explore at:
    Dataset updated
    Sep 29, 2023
    Dataset provided by
    nexdata technology inc
    Nexdata
    Authors
    Nexdata
    Variables measured
    Device, Accuracy, Data size, Data format, Data content, Photographic angle, Collecting environment, Population distribution, Nationality distribution
    Description

    14,511 Images English Handwriting OCR Data. The text carrier are A4 paper, lined paper, English paper, etc. The device is cellphone, the collection angle is eye-level angle. The dataset content includes English composition, poetry, prose, news, stories, etc. For annotation, line-level quadrilateral bounding box annotation and transcription for the texts were annotated in the data.The dataset can be used for tasks such as English handwriting OCR.

  18. 207 Hours – Canadian Speaking English Speech Data by Mobile Phone

    • nexdata.ai
    Updated Nov 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nexdata (2023). 207 Hours – Canadian Speaking English Speech Data by Mobile Phone [Dataset]. https://www.nexdata.ai/datasets/speechrecog/1047
    Explore at:
    Dataset updated
    Nov 21, 2023
    Dataset authored and provided by
    Nexdata
    Area covered
    Canada
    Variables measured
    Format, Country, Speaker, Language, Accuracy Rate, Content category, Recording device, Recording condition
    Description

    English(Canada) Scripted Monologue Smartphone speech dataset, collected from monologue based on given scripts, covering generic domain, human-machine interaction, smart home command and control, in-car command and control, numbers and other domains. Transcribed with text content and other attributes. Our dataset was collected from extensive and diversify speakers(466 people in total), geographicly speaking, enhancing model performance in real and complex tasks.Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied.

  19. EMEA Data Suite | 3.3M Translations | 1.9M Words | 23 Languages | Natural...

    • datarade.ai
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). EMEA Data Suite | 3.3M Translations | 1.9M Words | 23 Languages | Natural Language Processing (NLP) Data | Translation Data | TTS | EMEA Coverage [Dataset]. https://datarade.ai/data-products/emea-data-suite-3-3m-translations-1-9m-words-23-languag-oxford-languages
    Explore at:
    .csv, .json, .mp3, .txt, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Syrian Arab Republic, Burundi, Uganda, Israel, Seychelles, Spain, Romania, Central African Republic, Bosnia and Herzegovina, Morocco
    Description

    Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Arabic Monolingual Dictionary Data: 66,500 headwords | 98,700 senses | 70,000 examples.

    2. Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 translation sentences.

    3. Arabic Synonyms and Antonyms Data: 55,100 synonyms.

    4. British English Monolingual Dictionary Data: 146,000 headwords | 230,000 senses | 149,000 examples.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.

    7. Catalan Monolingual Dictionary Data: 29,800 headwords | 47,400 senses | 25,600 examples.

    8. Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 translation sentences.

    9. Croatian Monolingual Dictionary Data: 129,600 headwords | 164,760 senses | 34,630 examples.

    10. Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 translation sentences.

    11. Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 translation sentences.

    12. Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 translation sentences.

    13. French Monolingual Dictionary Data: 42,000 headwords | 56,000 senses | 43,000 examples.

    14. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 translation sentences.

    15. German Monolingual Dictionary Data: 85,500 headwords | 78,000 senses | 55,000 examples.

    16. German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 translation sentences.

    17. German Word List Data: 338,000 wordforms.

    18. Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 translation sentences.

    19. Hebrew Monolingual Dictionary Data: 85,600 headwords | 104,100 senses | 94,000 examples.

    20. Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 translation sentences.

    21. Hungarian Monolingual Dictionary Data: 90,500 headwords | 155,300 senses | 42,500 examples.

    22. Italian Monolingual Dictionary Data: 102,500 headwords | 231,580 senses | 48,200 examples.

    23. Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 translation sentences.

    24. Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.

    25. Latvian Monolingual Dictionary Data: 36,000 headwords | 43,600 senses | 73,600 examples.

    26. Persian Bilingual Dictionary Data: 30,660 translations | 19,780 senses | 30,660 translation sentences.

    27. Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 translation sentences.

    28. Portuguese Monolingual Dictionary Data: 143,600 headwords | 285,500 senses | 69,300 examples.

    29. Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 translation sentences.

    30. Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.

    31. Romanian Monolingual Dictionary Data: 66,900 headwords | 113,500 senses | 2,700 examples.

    32. Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 translation sentences.

    33. Russian Monolingual Dictionary Data: 65,950 headwords | 57,500 senses | 51,900 examples.

    34. Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 translation sentences.

    35. Slovak Bilingual Dictionary Data: 254,300 translations | 172,100 senses | 85,000 translation sentences.

    36. Spanish Monolingual Dictionary Data: 73,000 headwords | 123,000 senses | 104,000 examples.

    37. Spanish Bilingu...

  20. E

    UK English Speecon database

    • catalogue.elra.info
    Updated Feb 22, 2007
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency) (2007). UK English Speecon database [Dataset]. https://catalogue.elra.info/en-us/repository/browse/ELRA-S0215/
    Explore at:
    Dataset updated
    Feb 22, 2007
    Dataset provided by
    ELRA (European Language Resources Association) and its operational body ELDA (Evaluations and Language resources Distribution Agency)
    ELRA (European Language Resources Association)
    License

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_END_USER.pdf

    https://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdfhttps://catalogue.elra.info/static/from_media/metashare/licences/ELRA_VAR.pdf

    Area covered
    United Kingdom
    Description

    The UK English Speecon database is divided into 2 sets: 1) The first set comprises the recordings of 606 adult UK English speakers (325 males, 281 females), recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place), and consisting of about 195 hours of audio data. 2) The second set comprises the recordings of 51 child UK English speakers (14 boys, 37 girls), recorded over 4 microphone channels in 1 recording environment (children room), and consisting of about 9 hours of audio data. This database is partitioned into 31 DVDs (first set) and 4 DVDs (second set).The speech databases made within the Speecon project were validated by SPEX, the Netherlands, to assess their compliance with the Speecon format and content specifications.Each of the four speech channels is recorded at 16 kHz, 16 bit, uncompressed unsigned integers in Intel format (lo-hi byte order). To each signal file corresponds an ASCII SAM label file which contains the relevant descriptive information.Each speaker uttered the following items (over 290 items for adults and over 210 items for children):Calibration data: 6 noise recordings The “silence word” recordingFree spontaneous items (adults only):5 minutes (session time) of free spontaneous, rich context items (story telling) (an open number of spontaneous topics out of a set of 30 topics)17 Elicited spontaneous items (adults only):3 dates, 2 times, 3 proper names, 2 city names, 1 letter sequence, 2 answers to questions, 3 telephone numbers, 1 language Read speech:30 phonetically rich sentences uttered by adults and 60 uttered by children5 phonetically rich words (adults only)4 isolated digits1 isolated digit sequence4 connected digit sequences1 telephone number3 natural numbers1 money amount2 time phrases (T1 : analogue, T2 : digital)3 dates (D1 : analogue, D2 : relative and general date, D3 : digital)3 letter sequences1 proper name2 city or street names2 questions2 special keyboard characters 1 Web address1 email address208 application specific words and phrases per session (adults)74 toy commands, 14 phone commands and 34 general commands (children)The following age distribution has been obtained: Adults: 321 speakers are between 16 and 30, 182 speakers are between 31 and 45, 103 speakers are over 46.Children: All 51 speakers are between 11 and 14.A pronunciation lexicon with a phonemic transcription in SAMPA is also included.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Markus Döring; Markus Döring (2022). English Wikipedia - Species Pages [Dataset]. http://doi.org/10.15468/c3kkgh
Organization logoOrganization logo

Data from: English Wikipedia - Species Pages

Related Article
Explore at:
31 scholarly articles cite this dataset (View in Google Scholar)
Dataset updated
Aug 23, 2022
Dataset provided by
Wikimedia Foundationhttp://www.wikimedia.org/
Global Biodiversity Information Facilityhttps://www.gbif.org/
Authors
Markus Döring; Markus Döring
Description

Species pages extracted from the English Wikipedia article XML dump from 2022-08-02. Multimedia, vernacular names and textual descriptions are extracted, but only pages with a taxobox or speciesbox template are recognized.

See https://github.com/mdoering/wikipedia-dwca for details.

Search
Clear search
Close search
Google apps
Main menu