100+ datasets found
  1. Data from: ColloCaid Sample Data

    • figshare.com
    • openresearch.surrey.ac.uk
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew (2023). ColloCaid Sample Data [Dataset]. http://doi.org/10.6084/m9.figshare.13028207.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COLLOCAID SAMPLE DATAThe ColloCaid Sample Data comprises approximately 2% of the ColloCaid lexical database. The sample covers 692 strong academic English collocations (LogDice >5.0) for 16 core academic lemmas used as collocation bases (or nodes): 5 nouns, 5 verbs, and 6 adjectives. The selection aims to give an overview of the range of data included in the full dataset. This includes collocations with bases classified with more than one part-of-speech tag (e.g. DEBATE, INDIVIDUAL), polysemous collocation bases giving rise to distinct collocation patterns (e.g. CODE), as well as collocation bases that evoke a very large and a very small number of collocations. The strongest eight lexical collocations listed for each base are enriched with three different curated example sentences adapted from corpora of expert academic English writing. COLLOCAID LEXICAL DATA 1.1The full ColloCaid lexical dataset consists of:• 572 core academic English lemmas (311 nouns, 184 verbs and 77 adjectives)• 32,645 academic collocations with the above lemmas• 29,028 example sentences of collocations in context

    Further information at http://www.collocaid.uk/

  2. h

    text-clustering-example-data

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2024
    Authors
    Jacob Moore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

    Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

      Direct Use
    

    This data can be loaded using the following Python code. from datasets import load_dataset

    ds = load_dataset('billingsmoore/text-clustering-example-data')

    It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.

  3. E

    Central Statistical Office Dataset

    • live.european-language-grid.eu
    • data.europa.eu
    xml
    Updated Sep 9, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Central Statistical Office Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/18867
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Sep 9, 2022
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Two Polish-English publications of the Polish Central Statistical Office in the XLIFF format: 1. "Statistical Yearbook of the Republic of Poland 2015" is the main summary publication of the Central Statistical Office, including a comprehensive set of statistical data describing the condition of the natural environment, the socio-economic and demographic situation of Poland, and its position in Europe and in the world. 2. "Women in Poland" contains statistical information regarding women's place and participation in socio-economic life of the country including international comparisons. The texts were aligned at the level of translation segments (mostly sentences and short paragraphs) and manually verified.

  4. N

    South English, IA Population Breakdown by Gender and Age Dataset: Male and...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). South English, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e200d9ce-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    South English
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of South English by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for South English. The dataset can be utilized to understand the population distribution of South English by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in South English. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for South English.

    Key observations

    Largest age group (population): Male # 45-49 years (24) | Female # 65-69 years (13). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the South English population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the South English is shown in the following column.
    • Population (Female): The female population in the South English is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in South English for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for South English Population by Gender. You can refer the same here

  5. English Conversation and Monologue speech dataset

    • kaggle.com
    Updated Jun 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2024). English Conversation and Monologue speech dataset [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/english-real-world-speech-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset

    Description

    English(America) Real-world Casual Conversation and Monologue speech dataset, covers self-media, conversation, live, lecture, variety-show, etc, mirrors real-world interactions. Transcribed with text content, speaker's ID, gender, and other attributes. Our dataset was collected from extensive and diversify speakers, geographicly speaking, enhancing model performance in real and complex tasks. Quality tested by various AI companies. We strictly adhere to data protection regulations and privacy standards, ensuring the maintenance of user privacy and legal rights throughout the data collection, storage, and usage processes, our datasets are all GDPR, CCPA, PIPL complied. For more details, please refer to the link: https://www.nexdata.ai/datasets/speechrecog/1115?source=Kaggle

    Format

    16kHz, 16 bit, wav, mono channel;

    Content category

    Including self-media, conversation, live, lecture, variety-show, etc;

    Recording environment

    Low background noise;

    Country

    America(USA);

    Language(Region) Code

    en-US;

    Language

    English;

    Features of annotation

    Transcription text, timestamp, speaker ID, gender.

    Accuracy Rate

    Sentence Accuracy Rate (SAR) 95%

    Licensing Information

    Commercial License

  6. h

    tiny-english-asr-sample-data

    • huggingface.co
    Updated Jul 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Ali Abbas (2025). tiny-english-asr-sample-data [Dataset]. https://huggingface.co/datasets/m-aliabbas1/tiny-english-asr-sample-data
    Explore at:
    Dataset updated
    Jul 25, 2025
    Authors
    Muhammad Ali Abbas
    Description

    m-aliabbas1/tiny-english-asr-sample-data dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. F

    English Human-Human Chat Dataset for Conversational AI & NLP

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). English Human-Human Chat Dataset for Conversational AI & NLP [Dataset]. https://www.futurebeeai.com/dataset/text-dataset/english-general-domain-conversation-text-dataset
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    The English General Domain Chat Dataset is a high-quality, text-based dataset designed to train and evaluate conversational AI, NLP models, and smart assistants in real-world English usage. Collected through FutureBeeAI’s trusted crowd community, this dataset reflects natural, native-level English conversations covering a broad spectrum of everyday topics.

    Conversational Text Data

    This dataset includes over 15000 chat transcripts, each featuring free-flowing dialogue between two native English speakers. The conversations are spontaneous, context-rich, and mimic informal, real-life texting behavior.

    Words per Chat: 300–700
    Turns per Chat: Up to 50 dialogue turns
    Contributors: 200 native English speakers from the FutureBeeAI Crowd Community
    Format: TXT, DOCS, JSON or CSV (customizable)
    Structure: Each record contains the full chat, topic tag, and metadata block

    Diversity and Domain Coverage

    Conversations span a wide variety of general-domain topics to ensure comprehensive model exposure:

    Music, books, and movies
    Health and wellness
    Children and parenting
    Family life and relationships
    Food and cooking
    Education and studying
    Festivals and traditions
    Environment and daily life
    Internet and tech usage
    Childhood memories and casual chatting

    This diversity ensures the dataset is useful across multiple NLP and language understanding applications.

    Linguistic Authenticity

    Chats reflect informal, native-level English usage with:

    Colloquial expressions and local dialect influence
    Domain-relevant terminology
    Language-specific grammar, phrasing, and sentence flow
    Inclusion of realistic details such as names, phone numbers, email addresses, locations, dates, times, local currencies, and culturally grounded references
    Representation of different writing styles and input quirks to ensure training data realism

    Metadata

    Every chat instance is accompanied by structured metadata, which includes:

    Participant Age
    Gender
    Country/Region
    Chat Domain
    Chat Topic
    Dialect

    This metadata supports model filtering, demographic-specific evaluation, and more controlled fine-tuning workflows.

    Data Quality Assurance

    All chat records pass through a rigorous QA process to maintain consistency and accuracy:

    Manual review for content completeness
    Format checks for chat turns and metadata
    Linguistic verification by native speakers
    Removal of inappropriate or unusable samples

    This ensures a clean, reliable dataset ready for high-performance AI model training.

    Applications

    This dataset is ideal for training and evaluating a wide range of text-based AI systems:

    Conversational AI / Chatbots
    Smart assistants and voicebots
    <div

  8. h

    indic-instruct-data-v0.1

    • huggingface.co
    Updated Jan 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AI4Bharat (2024). indic-instruct-data-v0.1 [Dataset]. https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 26, 2024
    Dataset authored and provided by
    AI4Bharat
    Description

    Indic Instruct Data v0.1

    A collection of different instruction datasets spanning English and Hindi languages. The collection consists of:

    Anudesh wikiHow Flan v2 (67k sample subset) Dolly Anthropic-HHH (5k sample subset) OpenAssistant v1 LymSys-Chat (50k sample subset)

    We translate the English subset of specific datasets using IndicTrans2 (Gala et al., 2023). The chrF++ scores of the back-translated example and the corresponding example is provided for quality assessment of the… See the full description on the dataset page: https://huggingface.co/datasets/ai4bharat/indic-instruct-data-v0.1.

  9. British English Language Datasets | 150+ Years of Research | Natural...

    • datarade.ai
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Natural Language Processing (NLP) Data | LLMs | TTS | Dictionary Display | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .csv, .json, .mp3, .wav, .xls, .xmlAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://www.lexico.com/
    Area covered
    United Kingdom
    Description

    Our British English language datasets are meticulously curated and annotated by experienced linguistics and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

    1. British English Monolingual Dictionary Data
    2. British English Synonyms and Antonyms Data
    3. British English Pronunciations with Audio

    Key Features (approximate numbers):

    1. British English Monolingual Dictionary Data

    Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

    • Headwords: 146,000
    • Senses: 230,000
    • Sentence examples: 149,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: twice a year
    1. British English Synonyms and Antonyms Data

    This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Usage Examples: 39,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. British English Pronunciations with audio (word-level)

    This dataset provides IPA transcriptions and mapped audio files for words in contemporary British English, with a focus on UK speaker usage. It includes syllabified transcriptions, variant spellings, part-of-speech tags, and pronunciation group identifiers. Audio files are supplied separately and linked where available – ideal for TTS, ASR, and pronunciation modeling.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

  10. N

    North English, IA Population Breakdown by Gender and Age Dataset: Male and...

    • neilsberg.com
    csv, json
    Updated Feb 24, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neilsberg Research (2025). North English, IA Population Breakdown by Gender and Age Dataset: Male and Female Population Distribution Across 18 Age Groups // 2025 Edition [Dataset]. https://www.neilsberg.com/research/datasets/e1f56ec4-f25d-11ef-8c1b-3860777c1fe6/
    Explore at:
    csv, jsonAvailable download formats
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Neilsberg Research
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    North English, Iowa
    Variables measured
    Male and Female Population Under 5 Years, Male and Female Population over 85 years, Male and Female Population Between 5 and 9 years, Male and Female Population Between 10 and 14 years, Male and Female Population Between 15 and 19 years, Male and Female Population Between 20 and 24 years, Male and Female Population Between 25 and 29 years, Male and Female Population Between 30 and 34 years, Male and Female Population Between 35 and 39 years, Male and Female Population Between 40 and 44 years, and 8 more
    Measurement technique
    The data presented in this dataset is derived from the latest U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates. To measure the three variables, namely (a) Population (Male), (b) Population (Female), and (c) Gender Ratio (Males per 100 Females), we initially analyzed and categorized the data for each of the gender classifications (biological sex) reported by the US Census Bureau across 18 age groups, ranging from under 5 years to 85 years and above. These age groups are described above in the variables section. For further information regarding these estimates, please feel free to reach out to us via email at research@neilsberg.com.
    Dataset funded by
    Neilsberg Research
    Description
    About this dataset

    Context

    The dataset tabulates the population of North English by gender across 18 age groups. It lists the male and female population in each age group along with the gender ratio for North English. The dataset can be utilized to understand the population distribution of North English by gender and age. For example, using this dataset, we can identify the largest age group for both Men and Women in North English. Additionally, it can be used to see how the gender ratio changes from birth to senior most age group and male to female ratio across each age group for North English.

    Key observations

    Largest age group (population): Male # 5-9 years (51) | Female # 10-14 years (81). Source: U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Content

    When available, the data consists of estimates from the U.S. Census Bureau American Community Survey (ACS) 2019-2023 5-Year Estimates.

    Age groups:

    • Under 5 years
    • 5 to 9 years
    • 10 to 14 years
    • 15 to 19 years
    • 20 to 24 years
    • 25 to 29 years
    • 30 to 34 years
    • 35 to 39 years
    • 40 to 44 years
    • 45 to 49 years
    • 50 to 54 years
    • 55 to 59 years
    • 60 to 64 years
    • 65 to 69 years
    • 70 to 74 years
    • 75 to 79 years
    • 80 to 84 years
    • 85 years and over

    Scope of gender :

    Please note that American Community Survey asks a question about the respondents current sex, but not about gender, sexual orientation, or sex at birth. The question is intended to capture data for biological sex, not gender. Respondents are supposed to respond with the answer as either of Male or Female. Our research and this dataset mirrors the data reported as Male and Female for gender distribution analysis.

    Variables / Data Columns

    • Age Group: This column displays the age group for the North English population analysis. Total expected values are 18 and are define above in the age groups section.
    • Population (Male): The male population in the North English is shown in the following column.
    • Population (Female): The female population in the North English is shown in the following column.
    • Gender Ratio: Also known as the sex ratio, this column displays the number of males per 100 females in North English for each age group.

    Good to know

    Margin of Error

    Data in the dataset are based on the estimates and are subject to sampling variability and thus a margin of error. Neilsberg Research recommends using caution when presening these estimates in your research.

    Custom data

    If you do need custom data for any of your research project, report or presentation, you can contact our research staff at research@neilsberg.com for a feasibility of a custom tabulation on a fee-for-service basis.

    Inspiration

    Neilsberg Research Team curates, analyze and publishes demographics and economic data from a variety of public and proprietary sources, each of which often includes multiple surveys and programs. The large majority of Neilsberg Research aggregated datasets and insights is made available for free download at https://www.neilsberg.com/research/.

    Recommended for further research

    This dataset is a part of the main dataset for North English Population by Gender. You can refer the same here

  11. English and maths

    • gov.uk
    Updated Nov 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Department for Education (2019). English and maths [Dataset]. https://www.gov.uk/government/statistical-data-sets/fe-data-library-skills-for-life
    Explore at:
    Dataset updated
    Nov 28, 2019
    Dataset provided by
    GOV.UKhttp://gov.uk/
    Authors
    Department for Education
    Description

    English and maths (formerly Skills for Life) qualifications are designed to give people the reading, writing, maths and communication skills they need in everyday life, to operate effectively in work and to help them succeed on other training courses.

    These data provide information on participation and achievements for English and maths qualifications and are broken down into a number of key reports.

    Can’t find what you’re looking for?

    If you need help finding data please refer to the table finder tool to search for specific breakdowns available for FE statistics.

    Current data

    https://assets.publishing.service.gov.uk/media/5f0c5c923a6f4003935c2c6f/201819-Nov_EandM_Part_and_Achieve.xlsx">English and maths data tool for participation and achievements 2018/19

     <p class="gem-c-attachment_metadata"><span class="gem-c-attachment_attribute">MS Excel Spreadsheet</span>, <span class="gem-c-attachment_attribute">10.9 MB</span></p>
    
    
    
    
     <p class="gem-c-attachment_metadata">This file may not be suitable for users of assistive technology.</p>
     <details data-module="ga4-event-tracker" data-ga4-event='{"event_name":"select_content","type":"detail","text":"Request an accessible format.","section":"Request an accessible format.","index_section":1}' class="gem-c-details govuk-details govuk-!-margin-bottom-0" title="Request an accessible format.">
    

    Request an accessible format.

      If you use assistive technology (such as a screen reader) and need a version of this document in a more accessible format, please email <a href="mailto:alternative.formats@education.gov.uk" target="_blank" class="govuk-link">alternative.formats@education.gov.uk</a>. Please tell us what format you need. It will help us if you say what assistive technology you use.
    

    Archive

  12. F

    American English General Conversation Speech Dataset for ASR

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). American English General Conversation Speech Dataset for ASR [Dataset]. https://www.futurebeeai.com/dataset/speech-dataset/general-conversation-english-usa
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United States
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Welcome to the US English General Conversation Speech Dataset — a rich, linguistically diverse corpus purpose-built to accelerate the development of English speech technologies. This dataset is designed to train and fine-tune ASR systems, spoken language understanding models, and generative voice AI tailored to real-world US English communication.

    Curated by FutureBeeAI, this 30 hours dataset offers unscripted, spontaneous two-speaker conversations across a wide array of real-life topics. It enables researchers, AI developers, and voice-first product teams to build robust, production-grade English speech models that understand and respond to authentic American accents and dialects.

    Speech Data

    The dataset comprises 30 hours of high-quality audio, featuring natural, free-flowing dialogue between native speakers of US English. These sessions range from informal daily talks to deeper, topic-specific discussions, ensuring variability and context richness for diverse use cases.

    Participant Diversity:
    Speakers: 60 verified native US English speakers from FutureBeeAI’s contributor community.
    Regions: Representing various provinces of United States of America to ensure dialectal diversity and demographic balance.
    Demographics: A balanced gender ratio (60% male, 40% female) with participant ages ranging from 18 to 70 years.
    Recording Details:
    Conversation Style: Unscripted, spontaneous peer-to-peer dialogues.
    Duration: Each conversation ranges from 15 to 60 minutes.
    Audio Format: Stereo WAV files, 16-bit depth, recorded at 16kHz sample rate.
    Environment: Quiet, echo-free settings with no background noise.

    Topic Diversity

    The dataset spans a wide variety of everyday and domain-relevant themes. This topic diversity ensures the resulting models are adaptable to broad speech contexts.

    Sample Topics Include:
    Family & Relationships
    Food & Recipes
    Education & Career
    Healthcare Discussions
    Social Issues
    Technology & Gadgets
    Travel & Local Culture
    Shopping & Marketplace Experiences, and many more.

    Transcription

    Each audio file is paired with a human-verified, verbatim transcription available in JSON format.

    Transcription Highlights:
    Speaker-segmented dialogues
    Time-coded utterances
    Non-speech elements (pauses, laughter, etc.)
    High transcription accuracy, achieved through double QA pass, average WER < 5%

    These transcriptions are production-ready, enabling seamless integration into ASR model pipelines or conversational AI workflows.

    Metadata

    The dataset comes with granular metadata for both speakers and recordings:

    Speaker Metadata: Age, gender, accent, dialect, state/province, and participant ID.
    Recording Metadata: Topic, duration, audio format, device type, and sample rate.

    Such metadata helps developers fine-tune model training and supports use-case-specific filtering or demographic analysis.

    Usage and Applications

    This dataset is a versatile resource for multiple English speech and language AI applications:

    ASR Development: Train accurate speech-to-text systems for US English.
    Voice Assistants: Build smart assistants capable of understanding natural American conversations.
    <div style="margin-top:10px; margin-bottom: 10px; padding-left: 30px; display: flex; gap: 16px; align-items:

  13. D

    Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’...

    • dataverse.no
    • dataverse.azure.uit.no
    • +1more
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen Nierenberg; Ellen Nierenberg (2024). Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’ Information Literacy Development: Knowing, Doing, and Feeling [Dataset]. http://doi.org/10.18710/SK0R1N
    Explore at:
    txt(21865), txt(19475), csv(55030), txt(14751), txt(26578), txt(16861), txt(28211), pdf(107685), pdf(657212), txt(12082), txt(16243), text/x-fixed-field(55030), pdf(65240), txt(8172), pdf(634629), txt(31896), application/x-spss-sav(51476), txt(4141), pdf(91121), application/x-spss-sav(31612), txt(35011), txt(23981), text/x-fixed-field(15653), txt(25369), txt(17935), csv(15653)Available download formats
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    DataverseNO
    Authors
    Ellen Nierenberg; Ellen Nierenberg
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Aug 8, 2019 - Jun 10, 2022
    Area covered
    Norway
    Description

    This data set contains the replication data and supplements for the article "Knowing, Doing, and Feeling: A three-year, mixed-methods study of undergraduates’ information literacy development." The survey data is from two samples: - cross-sectional sample (different students at the same point in time) - longitudinal sample (the same students and different points in time)Surveys were distributed via Qualtrics during the students' first and sixth semesters. Quantitative and qualitative data were collected and used to describe students' IL development over 3 years. Statistics from the quantitative data were analyzed in SPSS. The qualitative data was coded and analyzed thematically in NVivo. The qualitative, textual data is from semi-structured interviews with sixth-semester students in psychology at UiT, both focus groups and individual interviews. All data were collected as part of the contact author's PhD research on information literacy (IL) at UiT. The following files are included in this data set: 1. A README file which explains the quantitative data files. (2 file formats: .txt, .pdf)2. The consent form for participants (in Norwegian). (2 file formats: .txt, .pdf)3. Six data files with survey results from UiT psychology undergraduate students for the cross-sectional (n=209) and longitudinal (n=56) samples, in 3 formats (.dat, .csv, .sav). The data was collected in Qualtrics from fall 2019 to fall 2022. 4. Interview guide for 3 focus group interviews. File format: .txt5. Interview guides for 7 individual interviews - first round (n=4) and second round (n=3). File format: .txt 6. The 21-item IL test (Tromsø Information Literacy Test = TILT), in English and Norwegian. TILT is used for assessing students' knowledge of three aspects of IL: evaluating sources, using sources, and seeking information. The test is multiple choice, with four alternative answers for each item. This test is a "KNOW-measure," intended to measure what students know about information literacy. (2 file formats: .txt, .pdf)7. Survey questions related to interest - specifically students' interest in being or becoming information literate - in 3 parts (all in English and Norwegian): a) information and questions about the 4 phases of interest; b) interest questionnaire with 26 items in 7 subscales (Tromsø Interest Questionnaire - TRIQ); c) Survey questions about IL and interest, need, and intent. (2 file formats: .txt, .pdf)8. Information about the assignment-based measures used to measure what students do in practice when evaluating and using sources. Students were evaluated with these measures in their first and sixth semesters. (2 file formats: .txt, .pdf)9. The Norwegain Centre for Research Data's (NSD) 2019 assessment of the notification form for personal data for the PhD research project. In Norwegian. (Format: .pdf)

  14. Handwriting OCR Data of Japanese and Korean

    • kaggle.com
    Updated Oct 13, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Frank Wong (2023). Handwriting OCR Data of Japanese and Korean [Dataset]. https://www.kaggle.com/datasets/nexdatafrank/handwriting-ocr-data-of-japanese-and-korean/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Frank Wong
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Description This dadaset was collected from 100 subjects including 50 Japanese, 49 Koreans and 1 Afghan. For different subjects, the corpus are different. The data diversity includes multiple cellphone models and different corpus. This dataset can be used for tasks, such as handwriting OCR data of Japanese and Korean. For more details, please visit: https://www.nexdata.ai/datasets/ocr/127?source=Kaggle

    Specifications

    Data size 100 people, the total number of handwriting piece is 22,163, at least 159 handwriting pieces for each subject Nationality distribution 50 Japanese, 49 Koreans and 1 Afghan Gender distribution males Age distribution the young and middle-aged people are the majorities Data diversity multiple cellphone models, different corpus Device cellphone Data format .json Annotation content text content, age, nationality, trace of handwriting Accuracy The annotation accuracy is not less than 95%

    Get the Dataset This is just an example of the data. To access more sample data or request the price, contact us at info@nexdata.ai

  15. OpenSeek-Pretrain-Data-Examples

    • huggingface.co
    Updated May 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Beijing Academy of Artificial Intelligence (2025). OpenSeek-Pretrain-Data-Examples [Dataset]. https://huggingface.co/datasets/BAAI/OpenSeek-Pretrain-Data-Examples
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 29, 2025
    Dataset authored and provided by
    Beijing Academy of Artificial Intelligence
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OpenSeek Pretraining Dataset v1.0 (Sample Release)

    We have released a portion of the sampled data from the OpenSeek Pretraining Dataset v1.0, primarily including Chinese and English Common Crawl (CC) datasets. Additional domain-specific datasets will be provided in future updates.

      📌 Dataset Sources
    

    English CC dataset: Mainly sourced from the Nemotron-CC dataset. Chinese CC dataset: Followed the Nemotron-CC data pipeline, based on aggregated open-source Chinese datasets.… See the full description on the dataset page: https://huggingface.co/datasets/BAAI/OpenSeek-Pretrain-Data-Examples.

  16. F

    British English Scripted Monologue Speech Data for Healthcare

    • futurebeeai.com
    wav
    Updated Aug 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FutureBee AI (2022). British English Scripted Monologue Speech Data for Healthcare [Dataset]. https://www.futurebeeai.com/dataset/monologue-speech-dataset/healthcare-scripted-speech-monologues-english-uk
    Explore at:
    wavAvailable download formats
    Dataset updated
    Aug 1, 2022
    Dataset provided by
    FutureBeeAI
    Authors
    FutureBee AI
    License

    https://www.futurebeeai.com/policies/ai-data-license-agreementhttps://www.futurebeeai.com/policies/ai-data-license-agreement

    Area covered
    United Kingdom
    Dataset funded by
    FutureBeeAI
    Description

    Introduction

    Introducing the UK English Scripted Monologue Speech Dataset for the Healthcare Domain, a voice dataset built to accelerate the development and deployment of English language automatic speech recognition (ASR) systems, with a sharp focus on real-world healthcare interactions.

    Speech Data

    This dataset includes over 6,000 high-quality scripted audio prompts recorded in UK English, representing typical voice interactions found in the healthcare industry. The data is tailored for use in voice technology systems that power virtual assistants, patient-facing AI tools, and intelligent customer service platforms.

    Participant Diversity
    Speakers: 60 native UK English speakers.
    Regional Balance: Participants are sourced from multiple regions across United Kingdom, reflecting diverse dialects and linguistic traits.
    Demographics: Includes a mix of male and female participants (60:40 ratio), aged between 18 and 70 years.
    Recording Specifications
    Nature of Recordings: Scripted monologues based on healthcare-related use cases.
    Duration: Each clip ranges between 5 to 30 seconds, offering short, context-rich speech samples.
    Audio Format: WAV files recorded in mono, with 16-bit depth and sample rates of 8 kHz and 16 kHz.
    Environment: Clean and echo-free spaces ensure clear and noise-free audio capture.

    Topic Coverage

    The prompts span a broad range of healthcare-specific interactions, such as:

    Patient check-in and follow-up communication
    Appointment booking and cancellation dialogues
    Insurance and regulatory support queries
    Medication, test results, and consultation discussions
    General health tips and wellness advice
    Emergency and urgent care communication
    Technical support for patient portals and apps
    Domain-specific scripted statements and FAQs

    Contextual Depth

    To maximize authenticity, the prompts integrate linguistic elements and healthcare-specific terms such as:

    Names: Gender- and region-appropriate United Kingdom names
    Addresses: Varied local address formats spoken naturally
    Dates & Times: References to appointment dates, times, follow-ups, and schedules
    Medical Terminology: Common medical procedures, symptoms, and treatment references
    Numbers & Measurements: Health data like dosages, vitals, and test result values
    Healthcare Institutions: Names of clinics, hospitals, and diagnostic centers

    These elements make the dataset exceptionally suited for training AI systems to understand and respond to natural healthcare-related speech patterns.

    Transcription

    Every audio recording is accompanied by a verbatim, manually verified transcription.

    Content: The transcription mirrors the exact scripted prompt recorded by the speaker.
    Format: Files are delivered in plain text (.TXT) format with consistent naming conventions for seamless integration.
    <b

  17. h

    simple-wiki

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Embedding Training Data, simple-wiki [Dataset]. https://huggingface.co/datasets/embedding-data/simple-wiki
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset authored and provided by
    Embedding Training Data
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for "simple-wiki"

      Dataset Summary
    

    This dataset contains pairs of equivalent sentences obtained from Wikipedia.

      Supported Tasks
    

    Sentence Transformers training; useful for semantic search and sentence similarity.

      Languages
    

    English.

      Dataset Structure
    

    Each example in the dataset contains pairs of equivalent sentences and is formatted as a dictionary with the key "set" and a list with the sentences as "value". {"set":… See the full description on the dataset page: https://huggingface.co/datasets/embedding-data/simple-wiki.

  18. Z

    Data from: #PraCegoVer dataset

    • data.niaid.nih.gov
    Updated Jan 19, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandra Avila (2023). #PraCegoVer dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5710561
    Explore at:
    Dataset updated
    Jan 19, 2023
    Dataset provided by
    Esther Luna Colombini
    Sandra Avila
    Gabriel Oliveira dos Santos
    Description

    Automatically describing images using natural sentences is an essential task to visually impaired people's inclusion on the Internet. Although there are many datasets in the literature, most of them contain only English captions, whereas datasets with captions described in other languages are scarce.

    PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content. Inspired by this movement, we have proposed the #PraCegoVer, a multi-modal dataset with Portuguese captions based on posts from Instagram. It is the first large dataset for image captioning in Portuguese with freely annotated images.

    PraCegoVer has 533,523 pairs with images and captions described in Portuguese collected from more than 14 thousand different profiles. Also, the average caption length in #PraCegoVer is 39.3 words and the standard deviation is 29.7.

    Dataset Structure

    PraCegoVer dataset is composed of the main file dataset.json and a collection of compressed files named images.tar.gz.partX

    containing the images. The file dataset.json comprehends a list of json objects with the attributes:

    user: anonymized user that made the post;

    filename: image file name;

    raw_caption: raw caption;

    caption: clean caption;

    date: post date.

    Each instance in dataset.json is associated with exactly one image in the images directory whose filename is pointed by the attribute filename. Also, we provide a sample with five instances, so the users can download the sample to get an overview of the dataset before downloading it completely.

    Download Instructions

    If you just want to have an overview of the dataset structure, you can download sample.tar.gz. But, if you want to use the dataset, or any of its subsets (63k and 173k), you must download all the files and run the following commands to uncompress and join the files:

    cat images.tar.gz.part* > images.tar.gz tar -xzvf images.tar.gz

    Alternatively, you can download the entire dataset from the terminal using the python script download_dataset.py available in PraCegoVer repository. In this case, first, you have to download the script and create an access token here. Then, you can run the following command to download and uncompress the image files:

    python download_dataset.py --access_token=

  19. Hate Speech Detection curated Dataset🤬

    • kaggle.com
    Updated Dec 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wendyellé A. Alban NYANTUDRE (2023). Hate Speech Detection curated Dataset🤬 [Dataset]. https://www.kaggle.com/datasets/waalbannyantudre/hate-speech-detection-curated-dataset/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 22, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wendyellé A. Alban NYANTUDRE
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    -**About this Data :** Social media platforms have become the most prominent medium for spreading hate speech, primarily through hateful textual content. An extensive dataset containing emoticons, emojis, hashtags, slang, and contractions is required to detect hate speech on social media based on current trends. This dataset contains hate speech sentences in English and is confined into two classes, one representing hateful content and the other representing non-hateful content.

    Specifications table
    SubjectNatural Language Processing - NLP
    Specific subject areaA curated dataset comprising emojis, emoticons, and contractions bundled into two classes, hateful and non-hateful, to detect hate speech in text.
    Type of dataText
    Data formatAnnotated, Analysed, Filtered Data
    Data ArticleA curated dataset for hate speech detection on social media text
    Data source locationhttps://data.mendeley.com/datasets/9sxpkmm8xn/1

    -**Value of this Data :** 1. This dataset is useful for training machine learning models to identify hate speech on social media in text. It reflects current social media trends and the modern ways of writing hateful text, using emojis, emoticons, or slang. It will help social media managers, administrators, or companies develop automatic systems to filter out hateful content on social media by identifying a text and categorizing it as hateful or non-hateful speech.
    2. Deep Learning (DL) and Natural Language Processing (NLP) practitioners can be the target beneficiaries as this dataset can be used for detecting hateful speech through DL and NLP techniques. Here the samples are composed of text sentences and labels belonging to two categories “0″ for non-hateful and “1″ for hateful.
    3. Additionally, this data set can be used as a benchmark data set to detect hate speech
    4. The data set is neutralized in such a way that it can be used by anyone as it doesn't include any entities or names which can have an impact or cyber harm on the user that generated the content. Researchers can take advantage of the pre-processed dataset for their projects as it maintains and follows the policy guidelines.

  20. u

    English Longitudinal Study of Ageing: Waves 0-11, 1998-2024

    • beta.ukdataservice.ac.uk
    Updated 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    J. Banks; G. David Batty; J. Breedvelt; K. Coughlin; Crawford, R., Institute For Fiscal Studies (IFS); M. Marmot; J. Nazroo; Oldfield, Z., Institute For Fiscal Studies (IFS); N. Steel; A. Steptoe; M. Wood; P. Zaninotto (2025). English Longitudinal Study of Ageing: Waves 0-11, 1998-2024 [Dataset]. http://doi.org/10.5255/ukda-sn-5050-32
    Explore at:
    Dataset updated
    2025
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    datacite
    Authors
    J. Banks; G. David Batty; J. Breedvelt; K. Coughlin; Crawford, R., Institute For Fiscal Studies (IFS); M. Marmot; J. Nazroo; Oldfield, Z., Institute For Fiscal Studies (IFS); N. Steel; A. Steptoe; M. Wood; P. Zaninotto
    Description

    The English Longitudinal Study of Ageing (ELSA) is a longitudinal survey of ageing and quality of life among older people that explores the dynamic relationships between health and functioning, social networks and participation, and economic position as people plan for, move into and progress beyond retirement. The main objectives of ELSA are to:

    • construct waves of accessible and well-documented panel data;
    • provide these data in a convenient and timely fashion to the scientific and policy research community;
    • describe health trajectories, disability and healthy life expectancy in a representative sample of the English population aged 50 and over;
    • examine the relationship between economic position and health;
    • investigate the determinants of economic position in older age;
    • describe the timing of retirement and post-retirement labour market activity; and
    • understand the relationships between social support, household structure and the transfer of assets.

    Further information may be found on the "https://www.elsa-project.ac.uk/"> ELSA project website, the or Natcen Social Research: ELSA web pages.

    Wave 11 data has been deposited - May 2025

    For the 45th edition (May 2025) ELSA Wave 11 core and pension grid data and documentation were deposited. Users should note this dataset version does not contain the survey weights. A version with the survey weights along with IFS and financial derived datasets will be deposited in due course. In the meantime, more information about the data collection or the data collected during this wave of ELSA can be found in the Wave 11 Technical Report or the User Guide.

    Health conditions research with ELSA - June 2021

    The ELSA Data team have found some issues with historical data measuring health conditions. If you are intending to do any analysis looking at the following health conditions, then please read the ELSA User Guide or if you still have questions contact elsadata@natcen.ac.uk for advice on how you should approach your analysis. The affected conditions are: eye conditions (glaucoma; diabetic eye disease; macular degeneration; cataract), CVD conditions (high blood pressure; angina; heart attack; Congestive Heart Failure; heart murmur; abnormal heart rhythm; diabetes; stroke; high cholesterol; other heart trouble) and chronic health conditions (chronic lung disease; asthma; arthritis; osteoporosis; cancer; Parkinson's Disease; emotional, nervous or psychiatric problems; Alzheimer's Disease; dementia; malignant blood disorder; multiple sclerosis or motor neurone disease).

    For information on obtaining data from ELSA that are not held at the UKDS, see the ELSA Genetic data access and Accessing ELSA data webpages.

    Wave 10 Health data
    Users should note that in Wave 10, the health section of the ELSA questionnaire has been revised and all respondents were asked anew about their health conditions, rather than following the prior approach of asking those who had taken part in the past waves to confirm previously recorded conditions. Due to this reason, the health conditions feed-forward data was not archived for Wave 10, as was done in previous waves.

    Harmonized dataset:

    Users of the Harmonized dataset who prefer to use the Stata version will need access to Stata MP software, as the version G3 file contains 11,779 variables (the limit for the standard Stata 'Intercooled' version is 2,047).

    ELSA COVID-19 study:
    A separate ad-hoc study conducted with ELSA respondents, measuring the socio-economic effects/psychological impact of the lockdown on the aged 50+ population of England, is also available under SN 8688, English Longitudinal Study of Ageing COVID-19 Study.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew (2023). ColloCaid Sample Data [Dataset]. http://doi.org/10.6084/m9.figshare.13028207.v2
Organization logo

Data from: ColloCaid Sample Data

Related Article
Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
zipAvailable download formats
Dataset updated
May 30, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Ana Frankenberg-Garcia; Geraint Paul Rees; Robert Lew
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

COLLOCAID SAMPLE DATAThe ColloCaid Sample Data comprises approximately 2% of the ColloCaid lexical database. The sample covers 692 strong academic English collocations (LogDice >5.0) for 16 core academic lemmas used as collocation bases (or nodes): 5 nouns, 5 verbs, and 6 adjectives. The selection aims to give an overview of the range of data included in the full dataset. This includes collocations with bases classified with more than one part-of-speech tag (e.g. DEBATE, INDIVIDUAL), polysemous collocation bases giving rise to distinct collocation patterns (e.g. CODE), as well as collocation bases that evoke a very large and a very small number of collocations. The strongest eight lexical collocations listed for each base are enriched with three different curated example sentences adapted from corpora of expert academic English writing. COLLOCAID LEXICAL DATA 1.1The full ColloCaid lexical dataset consists of:• 572 core academic English lemmas (311 nouns, 184 verbs and 77 adjectives)• 32,645 academic collocations with the above lemmas• 29,028 example sentences of collocations in context

Further information at http://www.collocaid.uk/

Search
Clear search
Close search
Google apps
Main menu