100+ datasets found

Data from: ColloCaid Sample Data
figshare.com
openresearch.surrey.ac.uk
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew (2023). ColloCaid Sample Data [Dataset]. http://doi.org/10.6084/m9.figshare.13028207.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.13028207.v2
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COLLOCAID SAMPLE DATAThe ColloCaid Sample Data comprises approximately 2% of the ColloCaid lexical database. The sample covers 692 strong academic English collocations (LogDice >5.0) for 16 core academic lemmas used as collocation bases (or nodes): 5 nouns, 5 verbs, and 6 adjectives. The selection aims to give an overview of the range of data included in the full dataset. This includes collocations with bases classified with more than one part-of-speech tag (e.g. DEBATE, INDIVIDUAL), polysemous collocation bases giving rise to distinct collocation patterns (e.g. CODE), as well as collocation bases that evoke a very large and a very small number of collocations. The strongest eight lexical collocations listed for each base are enriched with three different curated example sentences adapted from corpora of expert academic English writing. COLLOCAID LEXICAL DATA 1.1The full ColloCaid lexical dataset consists of:• 572 core academic English lemmas (311 nouns, 184 verbs and 77 adjectives)• 32,645 academic collocations with the above lemmas• 29,028 example sentences of collocations in context

Further information at http://www.collocaid.uk/
h
text-clustering-example-data
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2024
Authors
Jacob Moore
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

Direct Use

This data can be loaded using the following Python code. from datasets import load_dataset

ds = load_dataset('billingsmoore/text-clustering-example-data')

It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.
E
Central Statistical Office Dataset
live.european-language-grid.eu
data.europa.eu
xml
Updated Sep 9, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). Central Statistical Office Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18867
Explore at:
xmlAvailable download formats
Dataset updated
Sep 9, 2022
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Two Polish-English publications of the Polish Central Statistical Office in the XLIFF format: 1. "Statistical Yearbook of the Republic of Poland 2015" is the main summary publication of the Central Statistical Office, including a comprehensive set of statistical data describing the condition of the natural environment, the socio-economic and demographic situation of Poland, and its position in Europe and in the world. 2. "Women in Poland" contains statistical information regarding women's place and participation in socio-economic life of the country including international comparisons. The texts were aligned at the level of translation segments (mostly sentences and short paragraphs) and manually verified.
N
South English, IA Population Breakdown by Gender and Age Dataset: Male and...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). South English, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e200d9ce-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
South English
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of South English by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for South English. The dataset can be utilized to understand the population distribution of South English by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in South English. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for South English.

Key observations

Largest age group (population): Male # 45-49 years (24) | Female # 65-69 years (13). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the South English population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the South English is shown in the following column.

Population (Female): The female population in the South English is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in South English for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for South English Population by Gender. You can refer the same here
English Conversation and Monologue speech dataset
kaggle.com
Updated Jun 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
English(America) Real-world Casual Conversation and Monologue speech dataset

Description

English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

Format

16kHz, 16 bit, wav, mono channel;

Content category

Including self-media, conversation, live, lecture, variety-show, etc;

Recording environment

Low background noise;

Country

America(USA);

Language(Region) Code

en-US;

Language

English;

Features of annotation

Transcription text, timestamp, speaker ID, gender.

Accuracy Rate

Sentence Accuracy Rate (SAR) 95%

Licensing Information

Commercial License
h
tiny-english-asr-sample-data
huggingface.co
Updated Jul 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Ali Abbas (2025). tiny-english-asr-sample-data [Dataset]. https://huggingface.co/datasets/m-aliabbas1/tiny-english-asr-sample-data
Explore at:
Dataset updated
Jul 25, 2025
Authors
Muhammad Ali Abbas
Description
m-aliabbas1/tiny-english-asr-sample-data dataset hosted on Hugging Face and contributed by the HF Datasets community
F
English Human-Human Chat Dataset for Conversational AI & NLP
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Dataset funded by
FutureBeeAI
Description
Introduction
The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.
Conversational Text Data
This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.
•
Words per Chat: 300–700

•
Turns per Chat: Up to 50 dialogue turns

•
Contributors: 200 native English speakers from the FutureBeeAI Crowd Community

•
Format: TXT, DOCS, JSON or CSV (customizable)

•
Structure: Each record contains the full chat, topic tag, and metadata block

Diversity and Domain Coverage
Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:
•Music, books, and movies
•Health and wellness
•Children and parenting
•Family life and relationships
•Food and cooking
•Education and studying
•Festivals and traditions
•Environment and daily life
•Internet and tech usage
•Childhood memories and casual chatting
This diversity ensures the dataset is useful across multiple NLP and language understanding applications.
Linguistic Authenticity
Chats reflect informal, native-level English usage with:
•Colloquial expressions and local dialect influence
•Domain-relevant terminology
•Language-specific grammar, phrasing, and sentence flow
•Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
•Representation of different writing styles and input quirks to ensure training data realism
Metadata
Every chat instance is accompanied by structured metadata, which includes:
•Participant Age
•Gender
•Country/Region
•Chat Domain
•Chat Topic
•Dialect
This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.
Data Quality Assurance
All chat records pass through a rigorous QA process to maintain consistency and accuracy:
•Manual review for content completeness
•Format checks for chat turns and metadata
•Linguistic verification by native speakers
•Removal of inappropriate or unusable samples
This ensures a clean, reliable dataset ready for high-performance AI model training.
Applications
This dataset is ideal for training and evaluating a wide range of text-based AI systems:
•Conversational AI / Chatbots
•Smart assistants and voicebots
<div
h
indic-instruct-data-v0.1
huggingface.co
Updated Jan 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
AI4Bharat (2024). indic-instruct-data-v0.1 [Dataset]. https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 26, 2024
Dataset authored and provided by
AI4Bharat
Description
Indic Instruct Data v0.1

A collection of different instruction datasets spanning English and Hindi languages. The collection consists of:

Anudesh wikiHow Flan v2 (67k sample subset) Dolly Anthropic-HHH (5k sample subset) OpenAssistant v1 LymSys-Chat (50k sample subset)

We translate the English subset of specific datasets using IndicTrans2 (Gala et al., 2023). The chrF++ scores of the back-translated example and the corresponding example is provided for quality assessment of the… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1.
British English Language Datasets | 150+ Years of Research | Natural...
datarade.ai
Updated Jul 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
Dataset updated
Jul 30, 2025
Dataset authored and provided by
Oxford Languageshttps://www.lexico.com/
Area covered
United Kingdom
Description
Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

British English Monolingual Dictionary Data

British English Synonyms and Antonyms Data

British English Pronunciations with Audio

Key Features (approximate numbers):

British English Monolingual Dictionary Data

Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

Headwords: 146,000

Senses: 230,000

Sentence examples: 149,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: twice a year

British English Synonyms and Antonyms Data

This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

Synonyms: 600,000

Antonyms: 22,000

Usage Examples: 39,000

Format: XML and JSON format

Delivery: Email (link-based file sharing)

Updated frequency: annually

British English Pronunciations with audio (word-level)

This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.
N
North English, IA Population Breakdown by Gender and Age Dataset: Male and...
neilsberg.com
csv, json
Updated Feb 24, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Neilsberg Research (2025). North English, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1f56ec4-f25d-11ef-8c1b-3860777c1fe6/
Explore at:
csv, jsonAvailable download formats
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Neilsberg Research
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North English, Iowa
Variables measured
Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
Measurement technique
The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
Dataset funded by
Neilsberg Research
Description
About this dataset

Context

The dataset tabulates the population of North English by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for North English. The dataset can be utilized to understand the population distribution of North English by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in North English. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for North English.

Key observations

Largest age group (population): Male # 5-9 years (51) | Female # 10-14 years (81). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Content

When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

Age groups:

Under 5 years

5 to 9 years

10 to 14 years

15 to 19 years

20 to 24 years

25 to 29 years

30 to 34 years

35 to 39 years

40 to 44 years

45 to 49 years

50 to 54 years

55 to 59 years

60 to 64 years

65 to 69 years

70 to 74 years

75 to 79 years

80 to 84 years

85 years and over

Scope of gender :

Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

Variables / Data Columns

Age Group: This column displays the age group for the North English population analysis. Total expected values are 18 and are define above in the age groups section.

Population (Male): The male population in the North English is shown in the following column.

Population (Female): The female population in the North English is shown in the following column.

Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in North English for each age group.

Good to know

Margin of Error

Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

Custom data

If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

Inspiration

Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

Recommended for further research

This dataset is a part of the main dataset for North English Population by Gender. You can refer the same here
English and maths
gov.uk
Updated Nov 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department for Education (2019). English and maths [Dataset]. https://www.gov.uk/government/statistical-data-sets/fe-data-library-skills-for-life
Explore at:
Dataset updated
Nov 28, 2019
Dataset provided by
GOV.UKhttp://gov.uk/
Authors
Department for Education
Description
English and maths (formerly Skills for Life) qualifications are designed to give people the reading, writing, maths and communication skills they need in everyday life, to operate effectively in work and to help them succeed on other training courses.

These data provide information on participation and achievements for English and maths qualifications and are broken down into a number of key reports.

Can’t find what you’re looking for?

If you need help finding data please refer to the table finder tool to search for specific breakdowns available for FE statistics.

Current data

https://assets.publishing.service.gov.uk/media/5f0c5c923a6f4003935c2c6f/201819-Nov_EandM_Part_and_Achieve.xlsx">

https://assets.publishing.service.gov.uk/media/5f0c5c923a6f4003935c2c6f/201819-Nov_EandM_Part_and_Achieve.xlsx">English and maths data tool for participation and achievements 2018/19

<p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">10.9 MB</span></p> <p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p> <details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">

Request an accessible format.

If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alternative.formats@education.gov.uk" target="_blank" class="govuk-link">alternative.formats@education.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.

Archive

https://assets.publishing.service.gov.uk/media/5c17d7dce5274a46824303c3/English_maths_geography_tool_achievements_participation_201415_to_201718.xlsx">

https://assets.publishing.service.gov.uk/media/5c17d7dce5274a46824303c3/English_maths_geography_tool_achievements_participation_201415_to_201718.xlsx">English and maths data tool for participation and achievements 2014/15 to 2017
F
American English General Conversation Speech Dataset for ASR
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United States
Dataset funded by
FutureBeeAI
Description
Introduction
Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.
Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.
Speech Data
The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.
•Participant Diversity:
•
Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.

•
Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.

•
Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.

•Recording Details:
•
Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.

•
Duration: Each conversation ranges from 15 to 60 minutes.

•
Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.

•
Environment: Quiet, echo-free settings with no background noise.

Topic Diversity
The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.
•Sample Topics Include:
•Family & Relationships
•Food & Recipes
•Education & Career
•Healthcare Discussions
•Social Issues
•Technology & Gadgets
•Travel & Local Culture
•Shopping & Marketplace Experiences, and many more.
Transcription
Each audio file is paired with a human-verified, verbatim transcription available in JSON format.
•Transcription Highlights:
•Speaker-segmented dialogues
•Time-coded utterances
•Non-speech elements (pauses, laughter, etc.)
•High transcription accuracy, achieved through double QA pass, average WER < 5%
These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.
Metadata
The dataset comes with granular metadata for both speakers and recordings:
•
Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.

•
Recording Metadata: Topic, duration, audio format, device type, and sample rate.

Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.
Usage and Applications
This dataset is a versatile resource for multiple English speech and language AI applications:
•
ASR Development: Train accurate speech-to-text systems for US English.

•
Voice Assistants: Build smart assistants capable of understanding natural American conversations.

<div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:
D
Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’...
dataverse.no
dataverse.azure.uit.no
+1more
Updated Oct 8, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ellen Nierenberg; Ellen Nierenberg (2024). Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’ Information Literacy Development: Knowing, Doing, and Feeling [Dataset]. http://doi.org/10.18710/SK0R1N
Explore at:
txt(21865), txt(19475), csv(55030), txt(14751), txt(26578), txt(16861), txt(28211), pdf(107685), pdf(657212), txt(12082), txt(16243), text/x-fixed-field(55030), pdf(65240), txt(8172), pdf(634629), txt(31896), application/x-spss-sav(51476), txt(4141), pdf(91121), application/x-spss-sav(31612), txt(35011), txt(23981), text/x-fixed-field(15653), txt(25369), txt(17935), csv(15653)Available download formats
Unique identifier
https://doi.org/10.18710/SK0R1N
Dataset updated
Oct 8, 2024
Dataset provided by
DataverseNO
Authors
Ellen Nierenberg; Ellen Nierenberg
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Aug 8, 2019 - Jun 10, 2022
Area covered
Norway
Description
This data set contains the replication data and supplements for the article "Knowing, Doing, and Feeling: A three-year, mixed-methods study of undergraduates’ information literacy development." The survey data is from two samples: - cross-sectional sample (different students at the same point in time) - longitudinal sample (the same students and different points in time)Surveys were distributed via Qualtrics during the students' first and sixth semesters. Quantitative and qualitative data were collected and used to describe students' IL development over 3 years. Statistics from the quantitative data were analyzed in SPSS. The qualitative data was coded and analyzed thematically in NVivo. The qualitative, textual data is from semi-structured interviews with sixth-semester students in psychology at UiT, both focus groups and individual interviews. All data were collected as part of the contact author's PhD research on information literacy (IL) at UiT. The following files are included in this data set: 1. A README file which explains the quantitative data files. (2 file formats: .txt, .pdf)2. The consent form for participants (in Norwegian). (2 file formats: .txt, .pdf)3. Six data files with survey results from UiT psychology undergraduate students for the cross-sectional (n=209) and longitudinal (n=56) samples, in 3 formats (.dat, .csv, .sav). The data was collected in Qualtrics from fall 2019 to fall 2022. 4. Interview guide for 3 focus group interviews. File format: .txt5. Interview guides for 7 individual interviews - first round (n=4) and second round (n=3). File format: .txt 6. The 21-item IL test (Tromsø Information Literacy Test = TILT), in English and Norwegian. TILT is used for assessing students' knowledge of three aspects of IL: evaluating sources, using sources, and seeking information. The test is multiple choice, with four alternative answers for each item. This test is a "KNOW-measure," intended to measure what students know about information literacy. (2 file formats: .txt, .pdf)7. Survey questions related to interest - specifically students' interest in being or becoming information literate - in 3 parts (all in English and Norwegian): a) information and questions about the 4 phases of interest; b) interest questionnaire with 26 items in 7 subscales (Tromsø Interest Questionnaire - TRIQ); c) Survey questions about IL and interest, need, and intent. (2 file formats: .txt, .pdf)8. Information about the assignment-based measures used to measure what students do in practice when evaluating and using sources. Students were evaluated with these measures in their first and sixth semesters. (2 file formats: .txt, .pdf)9. The Norwegain Centre for Research Data's (NSD) 2019 assessment of the notification form for personal data for the PhD research project. In Norwegian. (Format: .pdf)
Handwriting OCR Data of Japanese and Korean
kaggle.com
Updated Oct 13, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Frank Wong (2023). Handwriting OCR Data of Japanese and Korean [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/handwriting-ocr-data-of-japanese-and-korean/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Frank Wong
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Description This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean. For more details, please visit: https://www.nexdata.ai/datasets/ocr/127?source=Kaggle

Specifications

Data size 100 people, the total number of handwriting piece is 22,163, at least 159 handwriting pieces for each subject Nationality distribution 50 Japanese, 49 Koreans and 1 Afghan Gender distribution males Age distribution the young and middle-aged people are the majorities Data diversity multiple cellphone models, different corpus Device cellphone Data format .json Annotation content text content, age, nationality, trace of handwriting Accuracy The annotation accuracy is not less than 95%

Get the Dataset This is just an example of the data. To access more sample data or request the price, contact us at info@nexdata.ai
OpenSeek-Pretrain-Data-Examples
huggingface.co
Updated May 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Beijing Academy of Artificial Intelligence (2025). OpenSeek-Pretrain-Data-Examples [Dataset]. https://huggingface.co/datasets/BAAI/OpenSeek-Pretrain-Data-Examples
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 29, 2025
Dataset authored and provided by
Beijing Academy of Artificial Intelligence
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
OpenSeek Pretraining Dataset v1.0 (Sample Release)

We have released a portion of the sampled data from the OpenSeek Pretraining Dataset v1.0, primarily including Chinese and English Common Crawl (CC) datasets. Additional domain-specific datasets will be provided in future updates.

📌 Dataset Sources

English CC dataset: Mainly sourced from the Nemotron-CC dataset. Chinese CC dataset: Followed the Nemotron-CC data pipeline, based on aggregated open-source Chinese datasets.… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/OpenSeek-Pretrain-Data-Examples.
F
British English Scripted Monologue Speech Data for Healthcare
futurebeeai.com
wav
Updated Aug 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
FutureBee AI (2022). British English Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-uk
Explore at:
wavAvailable download formats
Dataset updated
Aug 1, 2022
Dataset provided by
FutureBeeAI
Authors
FutureBee AI
License
https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement
Area covered
United Kingdom
Dataset funded by
FutureBeeAI
Description
Introduction
Introducing the UK English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.
Speech Data
This dataset includes over 6,000 high-quality scripted audio prompts recorded in UK English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.
•Participant Diversity
•
Speakers: 60 native UK English speakers.

•
Regional Balance: Participants are sourced from multiple regions across United Kingdom, reflecting diverse dialects and linguistic traits.

•
Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.

•Recording Specifications
•
Nature of Recordings: Scripted monologues based on healthcare-related use cases.

•
Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.

•
Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.

•
Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

Topic Coverage
The prompts span a broad range of healthcare-specific interactions, such as:
•Patient check-in and follow-up communication
•Appointment booking and cancellation dialogues
•Insurance and regulatory support queries
•Medication, test results, and consultation discussions
•General health tips and wellness advice
•Emergency and urgent care communication
•Technical support for patient portals and apps
•Domain-specific scripted statements and FAQs
Contextual Depth
To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:
•
Names: Gender- and region-appropriate United Kingdom names

•
Addresses: Varied local address formats spoken naturally

•
Dates & Times: References to appointment dates, times, follow-ups, and schedules

•
Medical Terminology: Common medical procedures, symptoms, and treatment references

•
Numbers & Measurements: Health data like dosages, vitals, and test result values

•
Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.
Transcription
Every audio recording is accompanied by a verbatim, manually verified transcription.
•
Content: The transcription mirrors the exact scripted prompt recorded by the speaker.

•
Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.

•
<b
h
simple-wiki
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
Embedding Training Data
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Card for "simple-wiki"

Dataset Summary

This dataset contains pairs of equivalent sentences obtained from Wikipedia.

Supported Tasks

Sentence Transformers training; useful for semantic search and sentence similarity.

Languages

English.

Dataset Structure

Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.
Z
Data from: #PraCegoVer dataset
data.niaid.nih.gov
Updated Jan 19, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
Explore at:
Dataset updated
Jan 19, 2023
Dataset provided by
Esther Luna Colombini
Sandra Avila
Gabriel Oliveira dos Santos
Description
Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

Dataset Structure

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

containing the images. The file dataset.json comprehends a list of json objects with the attributes:

user: anonymized user that made the post;

filename: image file name;

raw_caption: raw caption;

caption: clean caption;

date: post date.

Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

Download Instructions

If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

python download_dataset.py --access_token=

Hate Speech Detection curated Dataset🤬

kaggle.com

Updated Dec 22, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Wendyellé A. Alban NYANTUDRE (2023). Hate Speech Detection curated Dataset🤬 [Dataset]. https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset/data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 22, 2023

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Wendyellé A. Alban NYANTUDRE

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

-**About this Data :** Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content.

	Specifications table
Subject	Natural Language Processing - NLP
Specific subject area	A curated dataset comprising emojis, emoticons, and contractions bundled into two classes, hateful and non-hateful, to detect hate speech in text.
Type of data	Text
Data format	Annotated, Analysed, Filtered Data
Data Article	A curated dataset for hate speech detection on social media text
Data source location	https://data.mendeley.com/datasets/9sxpkmm8xn/1

-**Value of this Data :** 1. This dataset is useful for training machine learning models to identify hate speech on social media in text. It reflects current social media trends and the modern ways of writing hateful text, using emojis, emoticons, or slang. It will help social media managers, administrators, or companies develop automatic systems to filter out hateful content on social media by identifying a text and categorizing it as hateful or non-hateful speech.
2. Deep Learning (DL) and Natural Language Processing (NLP) practitioners can be the target beneficiaries as this dataset can be used for detecting hateful speech through DL and NLP techniques. Here the samples are composed of text sentences and labels belonging to two categories “0″ for non-hateful and “1″ for hateful.
3. Additionally, this data set can be used as a benchmark data set to detect hate speech
4. The data set is neutralized in such a way that it can be used by anyone as it doesn't include any entities or names which can have an impact or cyber harm on the user that generated the content. Researchers can take advantage of the pre-processed dataset for their projects as it maintains and follows the policy guidelines.

u
English Longitudinal Study of Ageing: Waves 0-11, 1998-2024
beta.ukdataservice.ac.uk
Updated 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
J. Banks; G. David Batty; J. Breedvelt; K. Coughlin; Crawford, R., Institute For Fiscal Studies (IFS); M. Marmot; J. Nazroo; Oldfield, Z., Institute For Fiscal Studies (IFS); N. Steel; A. Steptoe; M. Wood; P. Zaninotto (2025). English Longitudinal Study of Ageing: Waves 0-11, 1998-2024 [Dataset]. http://doi.org/10.5255/ukda-sn-5050-32
Explore at:
Unique identifier
https://doi.org/10.5255/ukda-sn-5050-32
Dataset updated
2025
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
datacite
Authors
J. Banks; G. David Batty; J. Breedvelt; K. Coughlin; Crawford, R., Institute For Fiscal Studies (IFS); M. Marmot; J. Nazroo; Oldfield, Z., Institute For Fiscal Studies (IFS); N. Steel; A. Steptoe; M. Wood; P. Zaninotto
Description
The English Longitudinal Study of Ageing (ELSA) is a longitudinal survey of ageing and quality of life among older people that explores the dynamic relationships between health and functioning, social networks and participation, and economic position as people plan for, move into and progress beyond retirement. The main objectives of ELSA are to:

construct waves of accessible and well-documented panel data;
provide these data in a convenient and timely fashion to the scientific and policy research community;
describe health trajectories, disability and healthy life expectancy in a representative sample of the English population aged 50 and over;
examine the relationship between economic position and health;
investigate the determinants of economic position in older age;
describe the timing of retirement and post-retirement labour market activity; and
understand the relationships between social support, household structure and the transfer of assets.

Further information may be found on the "https://www.elsa-project.ac.uk/"> ELSA project website, the or Natcen Social Research: ELSA web pages.

Wave 11 data has been deposited - May 2025

For the 45th edition (May 2025) ELSA Wave 11 core and pension grid data and documentation were deposited. Users should note this dataset version does not contain the survey weights. A version with the survey weights along with IFS and financial derived datasets will be deposited in due course. In the meantime, more information about the data collection or the data collected during this wave of ELSA can be found in the Wave 11 Technical Report or the User Guide.

Health conditions research with ELSA - June 2021

The ELSA Data team have found some issues with historical data measuring health conditions. If you are intending to do any analysis looking at the following health conditions, then please read the ELSA User Guide or if you still have questions contact elsadata@natcen.ac.uk for advice on how you should approach your analysis. The affected conditions are: eye conditions (glaucoma; diabetic eye disease; macular degeneration; cataract), CVD conditions (high blood pressure; angina; heart attack; Congestive Heart Failure; heart murmur; abnormal heart rhythm; diabetes; stroke; high cholesterol; other heart trouble) and chronic health conditions (chronic lung disease; asthma; arthritis; osteoporosis; cancer; Parkinson's Disease; emotional, nervous or psychiatric problems; Alzheimer's Disease; dementia; malignant blood disorder; multiple sclerosis or motor neurone disease).

For information on obtaining data from ELSA that are not held at the UKDS, see the ELSA Genetic data access and Accessing ELSA data webpages.

Wave 10 Health data
Users should note that in Wave 10, the health section of the ELSA questionnaire has been revised and all respondents were asked anew about their health conditions, rather than following the prior approach of asking those who had taken part in the past waves to confirm previously recorded conditions. Due to this reason, the health conditions feed-forward data was not archived for Wave 10, as was done in previous waves.

Harmonized dataset:

Users of the Harmonized dataset who prefer to use the Stata version will need access to Stata MP software, as the version G3 file contains 11,779 variables (the limit for the standard Stata 'Intercooled' version is 2,047).

ELSA COVID-19 study:
A separate ad-hoc study conducted with ELSA respondents, measuring the socio-economic effects/psychological impact of the lockdown on the aged 50+ population of England, is also available under SN 8688, English Longitudinal Study of Ageing COVID-19 Study.

Facebook

Twitter

Click to copy link

Link copied

Cite

Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew (2023). ColloCaid Sample Data [Dataset]. http://doi.org/10.6084/m9.figshare.13028207.v2

Data from: ColloCaid Sample Data

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

zipAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.13028207.v2

Dataset updated

May 30, 2023

Dataset provided by

Figsharehttp://figshare.com/

Authors

Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

COLLOCAID SAMPLE DATAThe ColloCaid Sample Data comprises approximately 2% of the ColloCaid lexical database. The sample covers 692 strong academic English collocations (LogDice >5.0) for 16 core academic lemmas used as collocation bases (or nodes): 5 nouns, 5 verbs, and 6 adjectives. The selection aims to give an overview of the range of data included in the full dataset. This includes collocations with bases classified with more than one part-of-speech tag (e.g. DEBATE, INDIVIDUAL), polysemous collocation bases giving rise to distinct collocation patterns (e.g. CODE), as well as collocation bases that evoke a very large and a very small number of collocations. The strongest eight lexical collocations listed for each base are enriched with three different curated example sentences adapted from corpora of expert academic English writing. COLLOCAID LEXICAL DATA 1.1The full ColloCaid lexical dataset consists of:• 572 core academic English lemmas (311 nouns, 184 verbs and 77 adjectives)• 32,645 academic collocations with the above lemmas• 29,028 example sentences of collocations in context

Further information at http://www.collocaid.uk/

Clear search

Close search

Google apps

Main menu

Data from: ColloCaid Sample Data

text-clustering-example-data

Central Statistical Office Dataset

South English, IA Population Breakdown by Gender and Age Dataset: Male and...

About this dataset

Content

Inspiration

Recommended for further research

English Conversation and Monologue speech dataset

English(America) Real-world Casual Conversation and Monologue speech dataset

Description

Format

Content category

Recording environment

Country

Language(Region) Code

Language

Features of annotation

Accuracy Rate

Licensing Information

tiny-english-asr-sample-data

English Human-Human Chat Dataset for Conversational AI & NLP

Introduction

Conversational Text Data

Diversity and Domain Coverage

Linguistic Authenticity

Metadata

Data Quality Assurance

Applications

indic-instruct-data-v0.1

British English Language Datasets | 150+ Years of Research | Natural...

North English, IA Population Breakdown by Gender and Age Dataset: Male and...

About this dataset

Content

Inspiration

Recommended for further research

English and maths

Can’t find what you’re looking for?

Current data

https://assets.publishing.service.gov.uk/media/5f0c5c923a6f4003935c2c6f/201819-Nov_EandM_Part_and_Achieve.xlsx">English and maths data tool for participation and achievements 2018/19

Archive

https://assets.publishing.service.gov.uk/media/5c17d7dce5274a46824303c3/English_maths_geography_tool_achievements_participation_201415_to_201718.xlsx">English and maths data tool for participation and achievements 2014/15 to 2017

American English General Conversation Speech Dataset for ASR

Introduction

Speech Data

Topic Diversity

Transcription

Metadata

Usage and Applications

Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’...

Handwriting OCR Data of Japanese and Korean

OpenSeek-Pretrain-Data-Examples

British English Scripted Monologue Speech Data for Healthcare

Introduction

Speech Data

Topic Coverage

Contextual Depth

Transcription

simple-wiki

Data from: #PraCegoVer dataset

PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

Hate Speech Detection curated Dataset🤬

English Longitudinal Study of Ageing: Waves 0-11, 1998-2024

Data from: ColloCaid Sample Data