100+ datasets found
  1. NLP Research Papers Dataset

    • kaggle.com
    zip
    Updated May 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subham Surana (2024). NLP Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/subhamjain/natural-language-processing-research-papers
    Explore at:
    zip(1074694 bytes)Available download formats
    Dataset updated
    May 1, 2024
    Authors
    Subham Surana
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The dataset appears to be a collection of NLP research papers, with the full text available in the "article" column, abstract summaries in the "abstract" column, and information about different sections in the "section_names" column. Researchers and practitioners in the field of natural language processing can use this dataset for various tasks, including text summarization, document classification, and analysis of research paper structures.

    Data Fields

    Here's a short description of the Natural Language Processing Research Papers dataset: 1. Article: This column likely contains the full text or content of the research papers related to Natural Language Processing (NLP). Each entry in this column represents the entire body of a specific research article. 2. Abstract: This column is likely to contain the abstracts of the NLP research papers. The abstract provides a concise summary of the paper, highlighting its key objectives, methods, and findings. 3. Section Names: This column probably contains information about the section headings within each research paper. It could include the names or titles of different sections such as Introduction, Methodology, Results, Conclusion, etc. This information can be useful for structuring and organizing the content of the research papers.

    File Description

    Content Overview: The dataset is valuable for researchers, students, and practitioners in the field of Natural Language Processing. File format: This file is csv format.

  2. Labeled NLP Prompts For Prompt Classification

    • kaggle.com
    zip
    Updated Aug 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    h9800 (2025). Labeled NLP Prompts For Prompt Classification [Dataset]. https://www.kaggle.com/datasets/has9800/labeled-nlp-prompts-for-prompt-classification
    Explore at:
    zip(778886 bytes)Available download formats
    Dataset updated
    Aug 12, 2025
    Authors
    h9800
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    As part of a project, I created synthetic data to classify NLP prompts to determine if prompts should be routed to a RAG wrapper around the LLM model, or sent to the model directly. I used Grok-4 Expert to outline the prompt generation constraints which generated ~50,000 labelled prompts.

    The data is synthesized using the following characteristics:

    Balance: Exactly 50/50 split to prevent class imbalance issues during training (we can add class weights if needed later).

    Variety: Draw from multiple categories to build robustness:

    Diversity Enhancers: Include variations in length (short vs. long), ambiguity (e.g., edge cases like "Tell me about AI ethics" which might lean non-RAG but could border factual), languages (mostly English, but a few multilingual for realism), and styles (questions, commands, statements). Add noise like typos or informal phrasing to mimic users.

    Adversarial Prompts (10%): Break patterns, e.g., "Latest news on Hogwarts?" (non-RAG despite "latest"), "Solve x^2 using recent math breakthroughs" (RAG despite "solve").

    Edge Cases: ~40% should be ambiguous to test the classifier's threshold (e.g., "What's the weather like?"—could be current/dynamic needing RAG, but we'll label based on rules like "requires real-time data").

    Reduced Predictable Keywords: Minimized "latest", "current" in RAG prompts; use varied phrasing (e.g., "What’s Tesla’s deal now?" vs. "Current Tesla stock?").

    Labeling Logic: I'll use CoT internally—e.g., if the prompt demands verifiable, updatable, or sourced info not inherent to an LLM's static knowledge, label 1; else 0. This mirrors our gateway's decision process.

    Human-Like Text: Increased noise probability to 30% with slang ("yo", "gimme", "rn", "lol", "uhh") and conversational fillers (e.g., "you know?"). Added to ~30% of prompts for realism, mimicking user queries like those on X or forums.

    Reduced Leakage: Varied templates to avoid predictable RAG triggers (e.g., "What’s the deal with {topic}?" instead of "Current {topic} policies"). Included adversarial prompts (e.g., "Latest news on Narnia?"—non-RAG).

    The data labels

    The data is labeled for my specific use case but below, I'll explain the labelling more in depth:

    A) Needs RAG (1): Factual/dynamic queries requiring external sources—e.g., historical events ("What caused the fall of the Roman Empire?"), medical advice ("Symptoms of diabetes?"), political updates ("Who won the 2024 US election?"), changing facts ("Current CEO of Apple?"), science with sources ("Latest research on quantum computing?"), finance ("S&P 500 return in 2023?"), or specific retrieval needs ("Key points from Tesla's Q2 2024 earnings call?").

    B) Doesn't Need RAG (0): Self-contained or generative—e.g., math ("Solve x^2 + 5x + 6 = 0"), opinions ("Best programming language for beginners?"), creative ("Write a haiku about summer"), logic ("If all cats are mammals, and Whiskers is a cat, what is Whiskers?"), timeless concepts ("Explain recursion in programming"), or hypothetical scenarios ("Design a workout plan for beginners").

    Not sure whether or not anyone would find this useful, but whoever does find a use for it, please comment it below for ideas.

  3. Portuguese Language Datasets | 300K Translations | Natural Language...

    • datarade.ai
    .json, .xml
    Updated Jul 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). Portuguese Language Datasets | 300K Translations | Natural Language Processing (NLP) Data | Dictionary Display | Translation | EU & LATAM Coverage [Dataset]. https://datarade.ai/data-products/portuguese-language-datasets-140k-words-300k-translations-oxford-languages
    Explore at:
    .json, .xmlAvailable download formats
    Dataset updated
    Jul 11, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Angola, Sao Tome and Principe, Guinea-Bissau, Brazil, Mozambique, Portugal, Timor-Leste, Cabo Verde, Macao
    Description

    Comprehensive Portuguese language datasets with linguistic annotations, including headwords, definitions, word senses, usage examples, part-of-speech (POS) tags, semantic metadata, and contextual usage details. Perfect for powering dictionary platforms, NLP, AI models, and translation systems.

    Our Portuguese language datasets are carefully compiled and annotated by language and linguistic experts. The below datasets in Portuguese are available for license:

    1. Portuguese Monolingual Dictionary Data
    2. Portuguese Bilingual Dictionary Data

    Key Features (approximate numbers):

    1. Portuguese Monolingual Dictionary Data

    Our Portuguese monolingual covers both EU and LATAM varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

    • Words:143,600
    • Senses: 285,500
    • Example sentences: 69,300
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Portuguese Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both EU and LATAM Portuguese varieties.

    • Translations: 300,000
    • Senses: 158,000
    • Example translations: 117,800
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include Natural Language Processing (NLP) applications, TTS, dictionary display tools, games, translations, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals.

    About the sample:

    The samples offer a brief overview of one or two language datasets (monolingual or/and bilingual dictionary data). To help you explore the structure and features of our dataset, we provide a sample in CSV format for preview purposes only.

    If you need the complete original sample or more details about any dataset, please contact us (Growth.OL@oup.com) to request access or further information

  4. American English Language Datasets | 150+ Years of Research | Textual Data |...

    • datarade.ai
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | Audio Data | Natural Language Processing (NLP) Data | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    United States
    Description

    Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.

    One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

    1. American English Monolingual Dictionary Data
    2. American English Synonyms and Antonyms Data
    3. American English Pronunciations with Audio

    Key Features (approximate numbers):

    1. American English Monolingual Dictionary Data

    Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

    • Headwords: 140,000
    • Senses: 222,000
    • Sentence examples: 140,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Synonyms and Antonyms Data

    The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Pronunciations with Audio (word-level)

    This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.

    About the sample:

    To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.

    Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.

  5. British English Language Datasets | 150+ Years of Research | Audio Data |...

    • datarade.ai
    Updated Jul 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). British English Language Datasets | 150+ Years of Research | Audio Data | Natural Language Processing (NLP) Data | EU Coverage [Dataset]. https://datarade.ai/data-products/british-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 30, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    United Kingdom
    Description

    Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on British English, provide linguistically annotated data. Ideal for NLP applications, Machine Learning (ML), LLM training and/or fine-tuning, as well as educational and game apps.

    Our British English language datasets are meticulously curated and annotated by experienced linguists and language experts, ensuring exceptional accuracy, consistency, and linguistic depth. The below datasets in British English are available for license:

    1. British English Monolingual Dictionary Data
    2. British English Synonyms and Antonyms Data
    3. British English Pronunciations with Audio

    Key Features (approximate numbers):

    1. British English Monolingual Dictionary Data

    Our British English monolingual dataset delivers clear, reliable definitions and authentic usage examples, featuring a high volume of headwords and in-depth coverage of the British English variant of English. As one of the world’s most authoritative lexical resources, it’s trusted by leading academic, AI, and language technology organizations.

    • Headwords: 146,000
    • Senses: 230,000
    • Sentence examples: 149,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: twice a year
    1. British English Synonyms and Antonyms Data

    This British English language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for NLP tasks such as semantic search, word sense disambiguation, and language generation.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Usage Examples: 39,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. British English Pronunciations with audio (word-level)

    This dataset provides IPA transcriptions and clean audio data in contemporary British English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).  

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:  

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs. 

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. 

    Please note that some datasets may have rights restrictions. Contact us for more information. 

    About the sample: 

    To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.  

    Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details. 

  6. Spam Classification for Basic NLP

    • kaggle.com
    zip
    Updated Mar 8, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chandramouli Naidu (2021). Spam Classification for Basic NLP [Dataset]. https://www.kaggle.com/chandramoulinaidu/spam-classification-for-basic-nlp
    Explore at:
    zip(5962068 bytes)Available download formats
    Dataset updated
    Mar 8, 2021
    Authors
    Chandramouli Naidu
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This Data consist of raw mail messages which is suitable for the NLP pre-processing like Tokenizing, Removing Stop words, Stemming and Parsing HTML tags. All the above steps are very important for someone who enters into NLP world. The dataset also goes hand-in-hand with NLP libraries like Vectorizer etc.

  7. h

    Design2Code-hf

    • huggingface.co
    Updated Mar 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Social And Language Technology Lab (2024). Design2Code-hf [Dataset]. http://doi.org/10.57967/hf/2412
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 6, 2024
    Dataset authored and provided by
    Social And Language Technology Lab
    License

    https://choosealicense.com/licenses/odc-by/https://choosealicense.com/licenses/odc-by/

    Description

    This dataset consists of 484 webpages from the C4 validation set, serving the purpose of testing multimodal LLMs on converting visual designs into code implementations. See the dataset in the raw files format here. Note that all images in these webpages are replaced by a placeholder image (rick.jpg) Please refer to our project page and our paper for more information.

  8. Dataset for Generation of multiple true false questions

    • zenodo.org
    zip
    Updated Nov 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij (2022). Dataset for Generation of multiple true false questions [Dataset]. http://doi.org/10.5281/zenodo.7303300
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 8, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Regina Kasakowskij; Regina Kasakowskij; Thomas Kasakowskij; Niels Seidel; Niels Seidel; Thomas Kasakowskij
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Generation of multiple true-false questions

    This project provides a Natural Language Pipeline for processing German Textbook sections as an input generating Multiple True-False Questions using GPT2.

    Assessments are an important part of the learning cycle and enable the development and promotion of competencies. However, the manual creation of assessments is very time-consuming. Therefore, the number of tasks in learning systems is often limited. In this repository, we provide an algorithm that can automatically generate an arbitrary number of German True False statements from a textbook using the GPT-2 model. The algorithm was evaluated with a selection of textbook chapters from four academic disciplines (see `data` folder) and rated by individual domain experts. One-third of the generated MTF Questions are suitable for learning. The algorithm provides instructors with an easier way to create assessments on chapters of textbooks to test factual knowledge.

    As a type of Multiple-Choice question, Multiple True False (MTF) Questions are, among other question types, a simple and efficient way to objectively test factual knowledge. The learner is challenged to distinguish between true and false statements. MTF questions can be presented differently, e.g. by locating a true statement from a series of false statements, identifying false statements among a list of true statements, or separately evaluating each statement as either true or false. Learners must evaluate each statement individually because a question stem can contain both incorrect and correct statements. Thus, MTF Questions as a machine-gradable format have the potential to identify learners’ misconceptions and knowledge gaps.

    Example MTF question:

    Check the correct statements:

    [ ] All trees have green leafs.

    [ ] Trees grow towards the sky.

    [ ] Leafes can fall from a tree.

    Features

    - generation of false statements

    - automatic selection of true statements

    - selection of an arbitrary similarity for true and false statements as well as the number of false statements

    - generating false statements by adding or deleting negations as well as using a german gpt2

    Setup

    Installation

    1. Create a new environment: `conda create -n mtfenv python=3.9`

    2. Activate the environment: `conda activate mtfenv`

    3. Install dependencies using anaconda:

    ```

    conda install -y -c conda-forge pdfplumber

    conda install -y -c conda-forge nltk

    conda install -y -c conda-forge pypdf2

    conda install -y -c conda-forge pylatexenc

    conda install -y -c conda-forge packaging

    conda install -y -c conda-forge transformers

    conda install -y -c conda-forge essential_generators

    conda install -y -c conda-forge xlsxwriter

    ```

    3. Download spacy: `python3.9 -m spacy download de_core_news_lg`

    Getting started

    After installation, you can execute the bash script `bash run.sh` in the terminal to compile MTF questions for the provided textbook chapters.

    To create MTF questions for your own texts use the following command:

    `python3 main.py --answers 1 --similarity 0.66 --input ./

    The parameter `answers` indicates how many false answers should be generated.

    By configuring the parameter `similarity` you can determine what portion of a sentence should remain the same. The remaining portion will be extracted and used to generate a false part of the sentence.

    ## History and roadmap

    * Outlook third iteration: Automatic augmentation of text chapters with generated questions

    * Second iteration: Generation of multiple true-false questions with improved text summarizer and German GPT2 sentence generator

    * First iteration: Generation of multiple true false questions in the Bachelor thesis of Mirjam Wiemeler

    Publications, citations, license

    Publications

    • Kasakowskij, R., Kasakowskij, T. & Seidel, N., (2022). Generation of Multiple True False Questions. In: Henning, P. A., Striewe, M. & Wölfel, M. (Hrsg.), 20. Fachtagung Bildungstechnologien (DELFI). Bonn: Gesellschaft für Informatik e.V.. (S. 147-152). DOI: [10.18420/delfi2022-026](https://dl.gi.de/handle/20.500.12116/38826)

    Citation of the Dataset

    The source code and data are maintained at GitHub: https://github.com/D2L2/multiple-true-false-question-generation

    Contact

    • Regina Kasakowskij (M.A.) - regina.kasakowskij@fernuni-hagen.de
    • Dr. Niels Seidel - niels.seidel@fernuni-hagen.de

    License Distributed under the MIT License. See [LICENSE.txt](https://gitlab.pi6.fernuni-hagen.de/la-diva/adaptive-assessment/generationofmultipletruefalsequestions/-/blob/master/LICENSE.txt) for more information.

    Acknowledgments This research was supported by CATALPA - Center of Advanced Technology for Assisted Learning and Predictive Analytics of the FernUniversität in Hagen, Germany.

    This project was carried out as part of research in the CATALPA project [LA DIVA](https://www.fernuni-hagen.de/forschung/schwerpunkte/catalpa/forschung/projekte/la-diva.shtml)

  9. Z

    Datasets of "An Automatically Generated Annotated Corpus for Albanian Named...

    • data.niaid.nih.gov
    Updated Nov 21, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hoxha, Klesti; Baxhaku, Artur (2022). Datasets of "An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7339198
    Explore at:
    Dataset updated
    Nov 21, 2022
    Dataset provided by
    University of Tirana
    Authors
    Hoxha, Klesti; Baxhaku, Artur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an Albanian named entities annotation corpus generated automatically (silver-standard) from Wikipedia and WikiData. It is offered in Apache OpenNLP annotation format.

    Details of the generation approach may be found in the respective published paper: https://doi.org/10.2478/cait-2018-0009

    Attached are also the files that were used for generating the Albanian named entities gazetteer and the gazetteer itself in JSON format.

  10. LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data...

    • datarade.ai
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). LATAM Data Suite | 1.8M+ Sentences | Natural Language Processing (NLP) Data | TTS | Dictionary Display | Translation Data | LATAM Coverage [Dataset]. https://datarade.ai/data-products/latam-data-suite-1-8m-sentences-nlp-tts-dictionary-d-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 22, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Peru, Bolivia (Plurinational State of), Panama, Uruguay, Colombia, Spain, Mexico, Ecuador, Dominican Republic, Puerto Rico
    Description

    LATAM Data Suite provides high-quality datasets in Spanish, Portuguese, and American English. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the LATAM Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentences Curated examples of real-world usage with contextual annotations.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for TTS and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists.

    Learn more about the datasets included in the data suite:

    1. Portuguese Monolingual Dictionary Data
    2. Portuguese Bilingual Dictionary Data
    3. Spanish Monolingual Dictionary Data
    4. Spanish Bilingual Dictionary Data
    5. Spanish Sentences Data
    6. Spanish Synonyms and Antonyms Data
    7. Spanish Audio Data
    8. Spanish Word List Data
    9. American English Monolingual Dictionary Data
    10. American English Synonyms and Antonyms Data
    11. American English Pronunciations with Audio

    Key Features (approximate numbers):

    1. Portuguese Monolingual Dictionary Data

    Our Portuguese monolingual covers both European and Latin American varieties, featuring clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Portuguese language.

    • Words: 143,600
    • Senses: 285,500
    • Example sentences: 69,300
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    1. Portuguese Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Portuguese and from Portuguese to English. It is annually reviewed and updated by our in-house team of language experts. Offers comprehensive coverage of the language, providing a substantial volume of translated words of excellent quality that span both European and Latin American Portuguese varieties.

    • Translations: 300,000
    • Senses: 158,000
    • Example translations: 117,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Monolingual Dictionary Data

    Our Spanish monolingual reliably offers clear definitions and examples, a large volume of headwords, and comprehensive coverage of the Spanish language.

    • Words: 73,000
    • Senses: 123,000
    • Example sentences: 104,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Bilingual Dictionary Data

    The bilingual data provides translations in both directions, from English to Spanish and from Spanish to English. It is annually reviewed and updated by our in-house team of language experts. Offers significant coverage of the language, providing a large volume of translated words of excellent quality.

    • Translations: 221,300
    • Senses: 103,500
    • Example sentences: 74,500
    • Example translations: 83,800
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. Spanish Sentences Data

    Spanish sentences retrieved from corpus are ideal for NLP model training, presenting approximately 20 million words. The sentences provide a great coverage of Spanish-speaking countries and are accordingly tagged to a particular country or dialect.

    • Sentences volume: 1,840,000
    • Format: XML and JSON formats
    • Delivery: Email (link-based file sharing) and REST API
    1. Spanish Synonyms and Antonyms Data

    This Spanish language dataset offers a rich collection of synonyms and antonyms, accompanied by detailed definitions and part-of-speech (POS) annotations, making it a comprehensive resource for building linguistically aware AI systems and language technologies.

    • Synonyms: 127,700
    • Antonyms: 9,500
    • Format: XML format
    • Delivery: Email (link-based file sharing)
    • Updated frequency: annually
    1. Spanish Audio Data (word-level)

    Curated word-level audio data for the Spanish language, which covers all varieties of world Spanish, providing rich dialectal diversity in the Spanish language.

    • Audio files: 20,900
    • Format: XLSX (for index), MP3 and WAV (audio files)
    1. Spanish Word List Data

    This language data contains a carefully curated and comprehensive list of 450,000 Spanish words.

    • Wordforms: 450,000
    • Format: CSV and TXT formats
    • Delivery: Email (link-based file sharing)
    1. American English Monolingual Dictionary Data

    Our American English Monolingual Dictionary Data is the foremost au...

  11. d

    Smoking NLP Challenge Data

    • dknet.org
    • neuinfo.org
    • +2more
    Updated Jan 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). Smoking NLP Challenge Data [Dataset]. http://identifiers.org/RRID:SCR_008644
    Explore at:
    Dataset updated
    Jan 29, 2022
    Description

    The data for the smoking challenge consisted exclusively of discharge summaries from Partners HealthCare which were preprocessed and converted into XML format, and separated into training and test sets. I2B2 is a data warehouse containing clinical data on over 150k patients, including outpatient DX, lab results, medications, and inpatient procedures. ETL processes authored to pull data from EMR and finance systems Institutional review boards of Partners HealthCare approved the challenge and the data preparation process. The data were annotated by pulmonologists and classified patients into Past Smokers, Current Smokers, Smokers, Non-smokers, and unknown. Second-hand smokers were considered non-smokers. Other institutions involved include Massachusetts Institute of Technology, and the State University of New York at Albany. i2b2 is a passionate advocate for the potential of existing clinical information to yield insights that can directly impact healthcare improvement. In our many use cases (Driving Biology Projects) it has become increasingly obvious that the value locked in unstructured text is essential to the success of our mission. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1. A similar set of notes from the Second i2b2 Challenge will be released on the one year anniversary of that Challenge (November, 2010).

  12. Data from: Using natural language processing to analyze unstructured...

    • tandf.figshare.com
    docx
    Updated Apr 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jin-Ah Sim; Xiaolei Huang; Madeline R. Horan; Justin N. Baker; I-Chan Huang (2024). Using natural language processing to analyze unstructured patient-reported outcomes data derived from electronic health records for cancer populations: a systematic review [Dataset]. http://doi.org/10.6084/m9.figshare.25341516.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 8, 2024
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Jin-Ah Sim; Xiaolei Huang; Madeline R. Horan; Justin N. Baker; I-Chan Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Patient-reported outcomes (PROs; symptoms, functional status, quality-of-life) expressed in the ‘free-text’ or ‘unstructured’ format within clinical notes from electronic health records (EHRs) offer valuable insights beyond biological and clinical data for medical decision-making. However, a comprehensive assessment of utilizing natural language processing (NLP) coupled with machine learning (ML) methods to analyze unstructured PROs and their clinical implementation for individuals affected by cancer remains lacking. This study aimed to systematically review published studies that used NLP techniques to extract and analyze PROs in clinical narratives from EHRs for cancer populations. We examined the types of NLP (with and without ML) techniques and platforms for data processing, analysis, and clinical applications. Utilizing NLP methods offers a valuable approach for processing and analyzing unstructured PROs among cancer patients and survivors. These techniques encompass a broad range of applications, such as extracting or recognizing PROs, categorizing, characterizing, or grouping PROs, predicting or stratifying risk for unfavorable clinical results, and evaluating connections between PROs and adverse clinical outcomes. The employment of NLP techniques is advantageous in converting substantial volumes of unstructured PRO data within EHRs into practical clinical utilities for individuals with cancer.

  13. Europe PMC Full Text Corpus

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre (2023). Europe PMC Full Text Corpus [Dataset]. http://doi.org/10.6084/m9.figshare.22848380.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Santosh Tirunagari; Xiao Yang; Shyamasree Saha; Aravind Venkatesan; Vid Vartak; Johanna McEntyre
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains the Europe PMC full text corpus, a collection of 300 articles from the Europe PMC Open Access subset. Each article contains 3 core entity types, manually annotated by curators: Gene/Protein, Disease and Organism.

    Corpus Directory Structure

    annotations/: contains annotations of the 300 full-text articles in the Europe PMC corpus. Annotations are provided in 3 different formats.

    hypothesis/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format.
    GROUP0/: contains raw manual annotations made by curator GROUP0. GROUP1/: contains raw manual annotations made by curator GROUP1. GROUP2/: contains raw manual annotations made by curator GROUP2.

    IOB/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in Inside–Outside–Beginning tagging format.
    dev/: contains IOB format annotations of 45 articles, suppose to be used a dev set in machine learning task. test/: contains IOB format annotations of 45 articles, suppose to be used a test set in machine learning task. train/: contains IOB format annotations of 210 articles, suppose to be used a training set in machine learning task.

    JSON/: contains automatically extracted annotations using raw manual annotations in hypothesis/csv/, which is in JSON format. README.md: a detailed description of all the annotation formats.

    articles/: contains the full-text articles annotated in Europe PMC corpus.

    Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser. XML/: contains XML articles directly fetched using Europe PMC Article Restful API. README.md: a detailed description of the sentencising and fetching of XML articles.

    docs/: contains related documents that were used for generating the corpus.

    Annotation guideline.pdf: annotation guideline that is provided to curators to assist the manual annotation. demo to molecular conenctions.pdf: annotation platform guideline that is provided to curator to help them get familiar with the Hypothes.is platform. Training set development.pdf: initial document that details the paper selection procedures.

    pilot/: contains annotations and articles that were used in a pilot study.

    annotations/csv/: contains raw annotations fetched from the annotation platform Hypothes.is in comma-separated values (CSV) format. articles/: contains the full-text articles annotated in the pilot study.

     Sentencised/: contains XML articles whose text has been split into sentences using the Europe PMC sentenciser.
     XML/: contains XML articles directly fetched using Europe PMC Article Restful API.
    

    README.md: a detailed description of the sentencising and fetching of XML articles.

    src/: source codes for cleaning annotations and generating IOB files

    metrics/ner_metrics.py: Python script contains SemEval evaluation metrics. annotations.py: Python script used to extract annotations from raw Hypothes.is annotations. generate_IOB_dataset.py: Python script used to convert JSON format annotations to IOB tagging format. generate_json_dataset.py: Python script used to extract annotations to JSON format. hypothesis.py: Python script used to fetch raw Hypothes.is annotations.

    License

    CCBY

    Feedback

    For any comment, question, and suggestion, please contact us through helpdesk@europepmc.org or Europe PMC contact page.

  14. NLP Benchmarking Data for Intent and Entity

    • kaggle.com
    zip
    Updated Apr 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joydeb Mondal (2020). NLP Benchmarking Data for Intent and Entity [Dataset]. https://www.kaggle.com/joydeb28/nlp-benchmarking-data-for-intent-and-entity
    Explore at:
    zip(364900 bytes)Available download formats
    Dataset updated
    Apr 30, 2020
    Authors
    Joydeb Mondal
    Description

    **Files Two directories are there for training and validation. Each of them contains all the files listed below. 1. AddToPlaylist.json 2. PlayMusic.json 3. SearchScreeningEvent.json 4. BookRestaurant.json 5. RateBook.json 6. GetWeather.json 7. SearchCreativeWork.json

    **Data Format There are 7 intents. Data is in JSON format where each entity is also tagged.

    **Usage So this dataset you can use for both intent classification and NER.

  15. Z

    MESINESP: Medical Semantic Indexing in Spanish - Train dataset

    • data-staging.niaid.nih.gov
    • live.european-language-grid.eu
    • +1more
    Updated Nov 5, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rana, Ankush; Gonzalez-Agirre, Aitor; Miranda-Escalada, Antonio; Krallinger, Martin (2022). MESINESP: Medical Semantic Indexing in Spanish - Train dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3826491
    Explore at:
    Dataset updated
    Nov 5, 2022
    Dataset provided by
    Barcelona Supercomputing Center
    Authors
    Rana, Ankush; Gonzalez-Agirre, Aitor; Miranda-Escalada, Antonio; Krallinger, Martin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Please use the MESINESP2 corpus (the second edition of the shared-task) since it has a higher level of curation, quality and is organized by document type (scientific articles, patents and clinical trials).

    INTRODUCTION:

    The Mesinesp (Spanish BioASQ track, see https://temu.bsc.es/mesinesp) training set has a total of 369,368 records.

    The training dataset contains all records from LILACS and IBECS databases at the Virtual Health Library (VHL) with a non-empty abstract written in Spanish. The URL used to retrieve records is as follows: http://pesquisa.bvsalud.org/portal/?output=xml&lang=es&sort=YEAR_DESC&format=abstract&filter[db][]=LILACS&filter[db][]=IBECS&q=&index=tw&

    We have filtered out empty abstracts and non-Spanish abstracts.

    The training dataset was crawled on 10/22/2019. This means that the data is a snapshot of that moment and that may change over time. In fact, it is very likely that the data will undergo minor changes as the different databases that make up LILACS and IBECS may add or modify the indexes.

    ZIP STRUCTURE:

    The training data sets contain 369,368 records from 26,609 different journals. Two different data sets are distributed as described below:

    • Original Train set with 369,368 records that also include the qualifiers, as retrieved from VHL.
    • Pre-processed Train set with the 318,658 records with at least one DeCS code and with no qualifiers.

    STATISTICS:

    Abstracts’ length (measured in characters) Min: 12 Avg: 1140.41 Median: 1094 Max: 9428

    Number of DeCS codes per file Min: 1 Avg: 8.12 Median: 7 Max: 53

    CORPUS FORMAT:

    The training data sets are distributed as a JSON file with the following format:

    { "articles": [ { "id": "Id of the article", "title": "Title of the article", "abstractText": "Content of the abstract", "journal": "Name of the journal", "year": 2018, "db": "Name of the database", "decsCodes": [ "code1", "code2", "code3" ] } ] }

    Note that the decsCodes field lists the DeCs Ids assigned to a record in the source data. Since the original XML data contain descriptors (no codes), we provide a DeCs conversion table (https://temu.bsc.es/mesinesp/wp-content/uploads/2019/12/DeCS.2019.v5.tsv.zip) with:

    • DeCs codes
    • Preferred descriptor (the label used in the European DeCs 2019 set)
    • List of synonyms (the descriptors and synonyms from both European and Latin Spanish DeCs 2019 data sets, separated by pipes)

    For more details on the Latin and European Spanish DeCs codes see: http://decs.bvs.br and http://decses.bvsalud.org/ respectively.

    Please, cite: Krallinger M, Krithara A, Nentidis A, Paliouras G, Villegas M. BioASQ at CLEF2020: Large-Scale Biomedical Semantic Indexing and Question Answering. InEuropean Conference on Information Retrieval 2020 Apr 14 (pp. 550-556). Springer, Cham.

    Copyright (c) 2020 Secretaría de Estado de Digitalización e Inteligencia Artificial

  16. Z

    Navigating News Narratives: A Media Bias Analysis Dataset

    • data-staging.niaid.nih.gov
    Updated Nov 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raza, Shaina (2023). Navigating News Narratives: A Media Bias Analysis Dataset [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_10037860
    Explore at:
    Dataset updated
    Nov 8, 2023
    Dataset provided by
    Vector Institute
    Authors
    Raza, Shaina
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The prevalence of bias in the news media has become a critical issue, affecting public perception on a range of important topics such as political views, health, insurance, resource distributions, religion, race, age, gender, occupation, and climate change. The media has a moral responsibility to ensure accurate information dissemination and to increase awareness about important issues and the potential risks associated with them. This highlights the need for a solution that can help mitigate against the spread of false or misleading information and restore public trust in the media. Data description: This is a dataset for news media bias covering different dimensions of the biases: political, hate speech, political, toxicity, sexism, ageism, gender identity, gender discrimination, race/ethnicity, climate change, occupation, spirituality, which makes it a unique contribution. The dataset used for this project does not contain any personally identifiable information (PII). Data Format: The format of data is:

    ID: Numeric unique identifier. Text: Main content. Dimension: Categorical descriptor of the text. Biased_Words: List of words considered biased. Aspect: Specific topic within the text. Label: Bias True/False value Aggregate Label: Calculated through multiple weighted formulae Annotation Scheme: The annotation scheme is based on Active learning, which is Manual Labeling --> Semi-Supervised Learning --> Human Verifications (iterative process)

    Bias Label: Indicate the presence/absence of bias (e.g., no bias, mild, strong). Words/Phrases Level Biases: Identify specific biased words/phrases. Subjective Bias (Aspect): Capture biases related to content aspects. List of datasets used : We curated different news categories like Climate crisis news summaries , occupational, spiritual/faith/ general using RSS to capture different dimensions of the news media biases. The annotation is performed using active learning to label the sentence (either neural/ slightly biased/ highly biased) and to pick biased words from the news. We also utilize publicly available data from the following links. Our Attribution to others. MBIC (media bias): Spinde, Timo, Lada Rudnitckaia, Kanishka Sinha, Felix Hamborg, Bela Gipp, and Karsten Donnay. "MBIC--A Media Bias Annotation Dataset Including Annotator Characteristics." arXiv preprint arXiv:2105.11910 (2021). https://zenodo.org/records/4474336
    Hyperpartisan news: Kiesel, Johannes, Maria Mestre, Rishabh Shukla, Emmanuel Vincent, Payam Adineh, David Corney, Benno Stein, and Martin Potthast. "Semeval-2019 task 4: Hyperpartisan news detection." In Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829-839. 2019. https://huggingface.co/datasets/hyperpartisan_news_detection Toxic comment classification: Adams, C.J., Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, Nithum, and Will Cukierski. 2017. "Toxic Comment Classification Challenge." Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Jigsaw Unintended Bias: Adams, C.J., Daniel Borkan, Inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum. 2019. "Jigsaw Unintended Bias in Toxicity Classification." Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification. Age Bias : Díaz, Mark, Isaac Johnson, Amanda Lazar, Anne Marie Piper, and Darren Gergle. "Addressing age-related bias in sentiment analysis." In Proceedings of the 2018 chi conference on human factors in computing systems, pp. 1-14. 2018. Age Bias Training and Testing Data - Age Bias and Sentiment Analysis Dataverse (harvard.edu) Multi-dimensional news Ukraine: Färber, Michael, Victoria Burkard, Adam Jatowt, and Sora Lim. "A multidimensional dataset based on crowdsourcing for analyzing and detecting news bias." In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 3007-3014. 2020. https://zenodo.org/records/3885351#.ZF0KoxHMLtV Social biases: Sap, Maarten, Saadia Gabriel, Lianhui Qin, Dan Jurafsky, Noah A. Smith, and Yejin Choi. "Social bias frames: Reasoning about social and power implications of language." arXiv preprint arXiv:1911.03891 (2019). https://maartensap.com/social-bias-frames/

    Goal of this dataset :We want to offer open and free access to dataset, ensuring a wide reach to researchers and AI practitioners across the world. The dataset should be user-friendly to use and uploading and accessing data should be straightforward, to facilitate usage. If you use this dataset, please cite us. Navigating News Narratives: A Media Bias Analysis Dataset © 2023 by Shaina Raza, Vector Institute is licensed under CC BY-NC 4.0

  17. c

    Ingredients Dataset – 18K+ Product Records with Ingredients Data from...

    • crawlfeeds.com
    csv, zip
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Ingredients Dataset – 18K+ Product Records with Ingredients Data from Beauty, Pets, Groceries & Health (CSV for AI & NLP) [Dataset]. https://crawlfeeds.com/datasets/ingredients-dataset-18k-product-records-with-ingredients-data-from-beauty-pets-groceries-health-csv-for-ai-nlp
    Explore at:
    csv, zipAvailable download formats
    Dataset updated
    Aug 20, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Ingredients Dataset (18K+ records) provides a high-quality, structured collection of product information with detailed ingredients data. Covering a wide variety of categories including beauty, pet care, groceries, and health products, this dataset is designed to power AI, NLP, and machine learning applications that require domain-specific knowledge of consumer products.

    Why This Dataset Matters

    In today’s data-driven economy, access to structured and clean datasets is critical for building intelligent systems. For industries like healthcare, beauty, food-tech, and retail, the ability to analyze product ingredients enables deeper insights, including:

    • Identifying allergens or harmful substances

    • Comparing ingredient similarities across brands

    • Training LLMs and NLP models for better understanding of consumer products

    • Supporting regulatory compliance and labeling standards

    • Enhancing recommendation engines for personalized shopping

    This dataset bridges the gap between raw, unstructured product data and actionable information by providing well-organized CSV files with fields that are easy to integrate into your workflows.

    Dataset Coverage

    The 18,000+ product records span several consumer categories:

    • 🛍 Beauty & Personal Care – cosmetics, skincare, haircare products with full ingredient transparency

    • 🐾 Pet Supplies – pet food and wellness products with detailed formulations

    • 🥫 Groceries & Packaged Foods – snacks, beverages, pantry staples with structured ingredients lists

    • 💊 Health & Wellness – supplements, vitamins, and healthcare products with nutritional components

    By including multiple categories, this dataset allows cross-domain analysis and model training that reflects real-world product diversity.

    Key Features

    • 📂 18,000+ records with structured ingredient fields

    • 🧾 Covers beauty, pet care, groceries, and health products

    • 📊 Delivered in CSV format, ready to use for analytics or machine learning

    • 🏷 Includes categories and breadcrumbs for taxonomy and classification

    • 🔎 Useful for AI, NLP, LLM fine-tuning, allergen detection, and product recommendation systems

    Use Cases

    1. AI & NLP Training – fine-tune LLMs on structured ingredients data for food, beauty, and healthcare applications.

    2. Retail Analytics – analyze consumer product composition across categories to inform pricing, positioning, and product launches.

    3. Food & Health Research – detect allergens, evaluate ingredient safety, and study nutritional compositions.

    4. Recommendation Engines – build smarter product recommendation systems for e-commerce platforms.

    5. Regulatory & Compliance Tools – ensure products meet industry and government standards through ingredient validation.

    Why Choose This Dataset

    Unlike generic product feeds, this dataset emphasizes ingredient transparency across multiple categories. With 18K+ records, it strikes a balance between being comprehensive and affordable, making it suitable for startups, researchers, and enterprise teams looking to experiment with product intelligence.

    Note: Each record includes a url (main page) and a buy_url (purchase page). Records are based on the buy_url to ensure unique, product-level data.

  18. h

    instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j

    • huggingface.co
    Updated Mar 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NLP Cloud (2023). instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j [Dataset]. https://huggingface.co/datasets/nlpcloud/instructions-dataset-adapted-from-stanford-alpaca-for-gpt-j
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 16, 2023
    Dataset authored and provided by
    NLP Cloud
    License

    https://choosealicense.com/licenses/gpl-3.0/https://choosealicense.com/licenses/gpl-3.0/

    Description

    This dataset is an adaptation of the Stanford Alpaca dataset in order to turn a text generation model like GPT-J into an "instruct" model. The initial dataset was slightly reworked in order to match the GPT-J fine-tuning format with Mesh Transformer Jax on TPUs.

  19. c

    Unlocking User Sentiment: The App Store Reviews Dataset

    • crawlfeeds.com
    json, zip
    Updated Jun 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Unlocking User Sentiment: The App Store Reviews Dataset [Dataset]. https://crawlfeeds.com/datasets/app-store-reviews-dataset
    Explore at:
    json, zipAvailable download formats
    Dataset updated
    Jun 20, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    This dataset offers a focused and invaluable window into user perceptions and experiences with applications listed on the Apple App Store. It is a vital resource for app developers, product managers, market analysts, and anyone seeking to understand the direct voice of the customer in the dynamic mobile app ecosystem.

    Dataset Specifications:

    • Investment: $45.0
    • Status: Published and immediately available.
    • Category: Ratings and Reviews Data
    • Format: Compressed ZIP archive containing JSON files, ensuring easy integration into your analytical tools and platforms.
    • Volume: Comprises 10,000 unique app reviews, providing a robust sample for qualitative and quantitative analysis of user feedback.
    • Timeliness: Last crawled: (This field is blank in your provided info, which means its recency is currently unknown. If this were a real product, specifying this would be critical for its value proposition.)

    Richness of Detail (11 Comprehensive Fields):

    Each record in this dataset provides a detailed breakdown of a single App Store review, enabling multi-dimensional analysis:

    1. Review Content:

      • review: The full text of the user's written feedback, crucial for Natural Language Processing (NLP) to extract themes, sentiment, and common keywords.
      • title: The title given to the review by the user, often summarizing their main point.
      • isEdited: A boolean flag indicating whether the review has been edited by the user since its initial submission. This can be important for tracking evolving sentiment or understanding user behavior.
    2. Reviewer & Rating Information:

      • username: The public username of the reviewer, allowing for analysis of engagement patterns from specific users (though not personally identifiable).
      • rating: The star rating (typically 1-5) given by the user, providing a quantifiable measure of satisfaction.
    3. App & Origin Context:

      • app_name: The name of the application being reviewed.
      • app_id: A unique identifier for the application within the App Store, enabling direct linking to app details or other datasets.
      • country: The country of the App Store storefront where the review was left, allowing for geographic segmentation of feedback.
    4. Metadata & Timestamps:

      • _id: A unique identifier for the specific review record in the dataset.
      • crawled_at: The timestamp indicating when this particular review record was collected by the data provider (Crawl Feeds).
      • date: The original date the review was posted by the user on the App Store.

    Expanded Use Cases & Analytical Applications:

    This dataset is a goldmine for understanding what users truly think and feel about mobile applications. Here's how it can be leveraged:

    • Product Development & Improvement:

      • Bug Detection & Prioritization: Analyze negative review text to identify recurring technical issues, crashes, or bugs, allowing developers to prioritize fixes based on user impact.
      • Feature Requests & Roadmap Prioritization: Extract feature suggestions from positive and neutral review text to inform future product roadmap decisions and develop features users actively desire.
      • User Experience (UX) Enhancement: Understand pain points related to app design, navigation, and overall usability by analyzing common complaints in the review field.
      • Version Impact Analysis: If integrated with app version data, track changes in rating and sentiment after new app updates to assess the effectiveness of bug fixes or new features.
    • Market Research & Competitive Intelligence:

      • Competitor Benchmarking: Analyze reviews of competitor apps (if included or combined with similar datasets) to identify their strengths, weaknesses, and user expectations within a specific app category.
      • Market Gap Identification: Discover unmet user needs or features that users desire but are not adequately provided by existing apps.
      • Niche Opportunities: Identify specific use cases or user segments that are underserved based on recurring feedback.
    • Marketing & App Store Optimization (ASO):

      • Sentiment Analysis: Perform sentiment analysis on the review and title fields to gauge overall user satisfaction, pinpoint specific positive and negative aspects, and track sentiment shifts over time.
      • Keyword Optimization: Identify frequently used keywords and phrases in reviews to optimize app store listings, improving discoverability and search ranking.
      • Messaging Refinement: Understand how users describe and use the app in their own words, which can inform marketing copy and advertising campaigns.
      • Reputation Management: Monitor rating trends and identify critical reviews quickly to facilitate timely responses and proactive customer engagement.
    • Academic & Data Science Research:

      • Natural Language Processing (NLP): The review and title fields are excellent for training and testing NLP models for sentiment analysis, topic modeling, named entity recognition, and text summarization.
      • User Behavior Analysis: Study patterns in rating distribution, isEdited status, and date to understand user engagement and feedback cycles.
      • Cross-Country Comparisons: Analyze country-specific reviews to understand regional differences in app perception, feature preferences, or cultural nuances in feedback.

    This App Store Reviews dataset provides a direct, unfiltered conduit to understanding user needs and ultimately driving better app performance and greater user satisfaction. Its structured format and granular detail make it an indispensable asset for data-driven decision-making in the mobile app industry.

  20. c

    Fox News dataset is for analyzing media trends and narratives

    • crawlfeeds.com
    csv, zip
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Crawl Feeds (2025). Fox News dataset is for analyzing media trends and narratives [Dataset]. https://crawlfeeds.com/datasets/fox-news-dataset
    Explore at:
    zip, csvAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset authored and provided by
    Crawl Feeds
    License

    https://crawlfeeds.com/privacy_policyhttps://crawlfeeds.com/privacy_policy

    Description

    The Fox News Dataset is a comprehensive collection of over 1 million news articles, offering an unparalleled resource for analyzing media narratives, public discourse, and political trends. Covering articles up to the year 2023, this dataset is a treasure trove for researchers, analysts, and businesses interested in gaining deeper insights into the topics and trends covered by Fox News.

    Key Features of the Fox News Dataset

    • Extensive Coverage: Contains more than 1 million articles spanning various topics and events up to 2023.
    • Research-Ready: Perfect for text classification, natural language processing (NLP), and other research purposes.
    • Format: Provided in CSV format for seamless integration into analytical and research tools.

    Why Use This Dataset?

    This large dataset is ideal for:

    • Text Classification: Develop machine learning models to classify and categorize news content.
    • Natural Language Processing (NLP): Conduct sentiment analysis, keyword extraction, or topic modeling.
    • Media and Political Research: Analyze media narratives, public opinion, and political trends reflected in Fox News articles.
    • Trend Analysis: Identify shifts in public discourse and media focus over time.

    Explore More News Datasets

    Discover additional resources for your research needs by visiting our news dataset collection. These datasets are tailored to support diverse analytical applications, including sentiment analysis and trend modeling.

    The Fox News Dataset is a must-have for anyone interested in exploring large-scale media data and leveraging it for advanced analysis. Ready to dive into this wealth of information? Download the dataset now in CSV format and start uncovering the stories behind the headlines.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Subham Surana (2024). NLP Research Papers Dataset [Dataset]. https://www.kaggle.com/datasets/subhamjain/natural-language-processing-research-papers
Organization logo

NLP Research Papers Dataset

Dataset for various tasks- Text Summarization, Document Classification, Analysis

Explore at:
77 scholarly articles cite this dataset (View in Google Scholar)
zip(1074694 bytes)Available download formats
Dataset updated
May 1, 2024
Authors
Subham Surana
License

http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

Description

Context

The dataset appears to be a collection of NLP research papers, with the full text available in the "article" column, abstract summaries in the "abstract" column, and information about different sections in the "section_names" column. Researchers and practitioners in the field of natural language processing can use this dataset for various tasks, including text summarization, document classification, and analysis of research paper structures.

Data Fields

Here's a short description of the Natural Language Processing Research Papers dataset: 1. Article: This column likely contains the full text or content of the research papers related to Natural Language Processing (NLP). Each entry in this column represents the entire body of a specific research article. 2. Abstract: This column is likely to contain the abstracts of the NLP research papers. The abstract provides a concise summary of the paper, highlighting its key objectives, methods, and findings. 3. Section Names: This column probably contains information about the section headings within each research paper. It could include the names or titles of different sections such as Introduction, Methodology, Results, Conclusion, etc. This information can be useful for structuring and organizing the content of the research papers.

File Description

Content Overview: The dataset is valuable for researchers, students, and practitioners in the field of Natural Language Processing. File format: This file is csv format.

Search
Clear search
Close search
Google apps
Main menu