100+ datasets found
  1. D

    Tutorial Package for: Text as Data in Economic Analysis

    • dataverse.nl
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T
    Explore at:
    text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    DataverseNL
    Authors
    Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2002 - May 31, 2023
    Dataset funded by
    Institute for New Economic Thinking
    Deutsche Forschungsgemeinschaft (403041268-TRR 266)
    Description

    This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.

  2. d

    SIAM 2007 Text Mining Competition dataset

    • catalog.data.gov
    • data.nasa.gov
    • +1more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.

  3. American English Language Datasets | 150+ Years of Research | Textual Data |...

    • datarade.ai
    Updated Jul 29, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | Audio Data | Natural Language Processing (NLP) Data | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    United States
    Description

    Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.

    One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

    1. American English Monolingual Dictionary Data
    2. American English Synonyms and Antonyms Data
    3. American English Pronunciations with Audio

    Key Features (approximate numbers):

    1. American English Monolingual Dictionary Data

    Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

    • Headwords: 140,000
    • Senses: 222,000
    • Sentence examples: 140,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Synonyms and Antonyms Data

    The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

    • Synonyms: 600,000
    • Antonyms: 22,000
    • Format: XML and JSON format
    • Delivery: Email (link-based file sharing) and REST API
    • Updated frequency: annually
    1. American English Pronunciations with Audio (word-level)

    This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.

    • Transcriptions (IPA): 250,000
    • Audio files: 180,000
    • Format: XLSX (for transcriptions), MP3 and WAV (audio files)
    • Updated frequency: annually

    Use Cases:

    We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).

    If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

    Pricing:

    Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

    Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.

    About the sample:

    To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.

    Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.

  4. Toy Data for Text Processing

    • kaggle.com
    zip
    Updated Aug 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga Belitskaya (2022). Toy Data for Text Processing [Dataset]. https://www.kaggle.com/datasets/olgabelitskaya/toy-data-for-text-processing
    Explore at:
    zip(830548 bytes)Available download formats
    Dataset updated
    Aug 19, 2022
    Authors
    Olga Belitskaya
    Description

    \[\color{darkgreen}{\mathbb{Context}}\]

    The main idea is collecting data for text experiments.

    \[\color{darkgreen}{\mathbb{Content}}\]

    Files with .txt,pdf, and similar formats. The source of information will be indicated for each file in its description.

    \[\color{darkgreen}{\mathbb{Acknowledgments}}\]

    Many thanks for the user comments.

    \[\color{darkgreen}{\mathbb{Inspiration}}\]

    Exercises, exercises, and... yes.. exercises again. Is it something new in text processing?

  5. example text data word & CSV format

    • kaggle.com
    zip
    Updated Apr 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    nohafathi (2025). example text data word & CSV format [Dataset]. https://www.kaggle.com/datasets/nohaaf/example-text-data/discussion?sort=undefined
    Explore at:
    zip(10243 bytes)Available download formats
    Dataset updated
    Apr 14, 2025
    Authors
    nohafathi
    Description

    Dataset

    This dataset was created by nohafathi

    Contents

  6. h

    text-classification-dataset-example

    • huggingface.co
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chien-Wei Chang (2024). text-classification-dataset-example [Dataset]. https://huggingface.co/datasets/cwchang/text-classification-dataset-example
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    Chien-Wei Chang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    cwchang/text-classification-dataset-example dataset hosted on Hugging Face and contributed by the HF Datasets community

  7. m

    Text Script Analytics Code for Automatic Video Generation

    • data.mendeley.com
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
    Explore at:
    Dataset updated
    Aug 22, 2025
    Authors
    gaganpreet gagan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

    It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

    Key Features:

    Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.

  8. d

    HTMLmetadata HTML formatted text files describing samples and spectra,...

    • catalog.data.gov
    • datasets.ai
    • +1more
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). HTMLmetadata HTML formatted text files describing samples and spectra, including photos [Dataset]. https://catalog.data.gov/dataset/htmlmetadata-html-formatted-text-files-describing-samples-and-spectra-including-photos
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    U.S. Geological Survey
    Description

    HTMLmetadata Text files in HTML-format containing metadata about samples and spectra. Also included in the zip file are folders containing information linked to from the HTML files, including: - README: contains a HTML version of the USGS Data Series publication, linked to this data release, that describes this spectral library (Kokaly and others, 2017). The folder also contains an HTML version of the release notes. - photo_images: contains full resolution images of photos of samples and field sites. - photo_thumbs: contains low-resolution thumbnail versions of photos of samples and field sites. GENERAL LIBRARY DESCRIPTION This data release provides the U.S. Geological Survey (USGS) Spectral Library Version 7 and all related documents. The library contains spectra measured with laboratory, field, and airborne spectrometers. The instruments used cover wavelengths from the ultraviolet to the far infrared (0.2 to 200 microns). Laboratory samples of specific minerals, plants, chemical compounds, and man-made materials were measured. In many cases, samples were purified, so that unique spectral features of a material can be related to its chemical structure. These spectro-chemical links are important for interpreting remotely sensed data collected in the field or from an aircraft or spacecraft. This library also contains physically-constructed as well as mathematically-computed mixtures. Measurements of rocks, soils, and natural mixtures of minerals have also been made with laboratory and field spectrometers. Spectra of plant components and vegetation plots, comprising many plant types and species with varying backgrounds, are also in this library. Measurements by airborne spectrometers are included for forested vegetation plots, in which the trees are too tall for measurement by a field spectrometer. The related U.S. Geological Survey Data Series publication, "USGS Spectral Library Version 7", describes the instruments used, metadata descriptions of spectra and samples, and possible artifacts in the spectral measurements (Kokaly and others, 2017). Four different spectrometer types were used to measure spectra in the library: (1) Beckman™ 5270 covering the spectral range 0.2 to 3 µm, (2) standard, high resolution (hi-res), and high-resolution Next Generation (hi-resNG) models of ASD field portable spectrometers covering the range from 0.35 to 2.5 µm, (3) Nicolet™ Fourier Transform Infra-Red (FTIR) interferometer spectrometers covering the range from about 1.12 to 216 µm, and (4) the NASA Airborne Visible/Infra-Red Imaging Spectrometer AVIRIS, covering the range 0.37 to 2.5 µm. Two fundamental spectrometer characteristics significant for interpreting and utilizing spectral measurements are sampling position (the wavelength position of each spectrometer channel) and bandpass (a parameter describing the wavelength interval over which each channel in a spectrometer is sensitive). Bandpass is typically reported as the Full Width at Half Maximum (FWHM) response at each channel (in wavelength units, for example nm or micron). The linked publication (Kokaly and others, 2017), includes a comparison plot of the various spectrometers used to measure the data in this release. Data for the sampling positions and the bandpass values (for each channel in the spectrometers) are included in this data release. These data are in the SPECPR files, as separate data records, and in the American Standard Code for Information Interchange (ASCII) text files, as separate files for wavelength and bandpass. Spectra are provided in files of ASCII text format (files with a .txt file extension). In the ASCII files, deleted channels (bad bands) are indicated by a value of -1.23e34. Metadata descriptions of samples, field areas, spectral measurements, and results from supporting material analyses – such as XRD – are provided in HyperText Markup Language HTML formatted ASCII text files (files with .html file extension). In addition, Graphics Interchange Format (GIF) images of plots of spectra are provided. For each spectrum a plot with wavelength in microns on the x-axis is provided. For spectra measured on the Nicolet spectrometer, an additional GIF image with wavenumber on the x-axis is provided. Data are also provided in SPECtrum Processing Routines (SPECPR) format (Clark, 1993) which packages spectra and associated metadata descriptions into a single file (see the linked publication, Kokaly and others, 2017, for additional details on the SPECPR format and freely-available software than can be used to read files in SPECPR format). The data measured on the source spectrometers are denoted by the “splib07a” tag in filenames. In addition to providing the original measurements, the spectra have been convolved and resampled to different spectrometer and multispectral sensor characteristics. The following list specifies the identifying tag for the measured and convolved libraries and gives brief descriptions of the sensors. splib07a – this is the name of the SPECPR file containing the spectra measured on the Beckman, ASD, Nicolet and AVIRIS spectrometers. The data are provided with their original sampling positions (wavelengths) and bandpass values. The prefix “splib07a_” is at the beginning of the ASCII and GIF files pertaining to the measured spectra. splib07b – this is the name of the SPECPR file containing a modified version of the original measurements. The results from using spectral convolution to convert measurements to other spectrometer characteristics can be improved by oversampling (increasing sample density). Thus, splib07b is an oversampled version of the library, computed using simple cubic-spline interpolation to produce spectra with fine sampling interval (therefore a higher number of channels) for Beckman and AVIRIS measurements. The spectra in this version of the library are the data used to create the convolved and resampled versions of the library. The prefix “splib07b_” is at the beginning of the ASCII and GIF files pertaining to the oversampled spectra. s07_ASD – this is the name of the SPECPR file containing the spectral library measurements convolved to standard resolution ASD full range spectrometer characteristics. The standard reported wavelengths of the ASD spectrometers used by the USGS were used (2151 channels with wavelength positions starting at 350 nm and increasing in 1 nm increments). The bandpass values of each channel were determined by comparing measurements of reference materials made on ASD spectrometers in comparison to measurements made of the same materials on higher resolution spectrometers (the procedure is described in Kokaly, 2011, and discussed in Kokaly and Skidmore, 2015, and Kokaly and others, 2017). The prefix “s07ASD_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV95 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1995 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV95_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV96 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1996 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV96_” is at the beginning of the ASCII, and GIF files. s07_AV97 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1997 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV97_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV98 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1998 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV98_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV99 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1999 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV99_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV00 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2000 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV00_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV01 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2001 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV01_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV05 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2005 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV05_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV06 – this is the name of the SPECPR file containing the spectral library measurements convolved to

  9. d

    AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and...

    • datarade.ai
    Updated Dec 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    MealMe (2024). AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites [Dataset]. https://datarade.ai/data-products/ai-training-data-annotated-checkout-flows-for-retail-resta-mealme
    Explore at:
    Dataset updated
    Dec 18, 2024
    Dataset authored and provided by
    MealMe
    Area covered
    United States of America
    Description

    AI Training Data | Annotated Checkout Flows for Retail, Restaurant, and Marketplace Websites Overview

    Unlock the next generation of agentic commerce and automated shopping experiences with this comprehensive dataset of meticulously annotated checkout flows, sourced directly from leading retail, restaurant, and marketplace websites. Designed for developers, researchers, and AI labs building large language models (LLMs) and agentic systems capable of online purchasing, this dataset captures the real-world complexity of digital transactions—from cart initiation to final payment.

    Key Features

    Breadth of Coverage: Over 10,000 unique checkout journeys across hundreds of top e-commerce, food delivery, and service platforms, including but not limited to Walmart, Target, Kroger, Whole Foods, Uber Eats, Instacart, Shopify-powered sites, and more.

    Actionable Annotation: Every flow is broken down into granular, step-by-step actions, complete with timestamped events, UI context, form field details, validation logic, and response feedback. Each step includes:

    Page state (URL, DOM snapshot, and metadata)

    User actions (clicks, taps, text input, dropdown selection, checkbox/radio interactions)

    System responses (AJAX calls, error/success messages, cart/price updates)

    Authentication and account linking steps where applicable

    Payment entry (card, wallet, alternative methods)

    Order review and confirmation

    Multi-Vertical, Real-World Data: Flows sourced from a wide variety of verticals and real consumer environments, not just demo stores or test accounts. Includes complex cases such as multi-item carts, promo codes, loyalty integration, and split payments.

    Structured for Machine Learning: Delivered in standard formats (JSONL, CSV, or your preferred schema), with every event mapped to action types, page features, and expected outcomes. Optional HAR files and raw network request logs provide an extra layer of technical fidelity for action modeling and RLHF pipelines.

    Rich Context for LLMs and Agents: Every annotation includes both human-readable and model-consumable descriptions:

    “What the user did” (natural language)

    “What the system did in response”

    “What a successful action should look like”

    Error/edge case coverage (invalid forms, OOS, address/payment errors)

    Privacy-Safe & Compliant: All flows are depersonalized and scrubbed of PII. Sensitive fields (like credit card numbers, user addresses, and login credentials) are replaced with realistic but synthetic data, ensuring compliance with privacy regulations.

    Each flow tracks the user journey from cart to payment to confirmation, including:

    Adding/removing items

    Applying coupons or promo codes

    Selecting shipping/delivery options

    Account creation, login, or guest checkout

    Inputting payment details (card, wallet, Buy Now Pay Later)

    Handling validation errors or OOS scenarios

    Order review and final placement

    Confirmation page capture (including order summary details)

    Why This Dataset?

    Building LLMs, agentic shopping bots, or e-commerce automation tools demands more than just page screenshots or API logs. You need deeply contextualized, action-oriented data that reflects how real users interact with the complex, ever-changing UIs of digital commerce. Our dataset uniquely captures:

    The full intent-action-outcome loop

    Dynamic UI changes, modals, validation, and error handling

    Nuances of cart modification, bundle pricing, delivery constraints, and multi-vendor checkouts

    Mobile vs. desktop variations

    Diverse merchant tech stacks (custom, Shopify, Magento, BigCommerce, native apps, etc.)

    Use Cases

    LLM Fine-Tuning: Teach models to reason through step-by-step transaction flows, infer next-best-actions, and generate robust, context-sensitive prompts for real-world ordering.

    Agentic Shopping Bots: Train agents to navigate web/mobile checkouts autonomously, handle edge cases, and complete real purchases on behalf of users.

    Action Model & RLHF Training: Provide reinforcement learning pipelines with ground truth “what happens if I do X?” data across hundreds of real merchants.

    UI/UX Research & Synthetic User Studies: Identify friction points, bottlenecks, and drop-offs in modern checkout design by replaying flows and testing interventions.

    Automated QA & Regression Testing: Use realistic flows as test cases for new features or third-party integrations.

    What’s Included

    10,000+ annotated checkout flows (retail, restaurant, marketplace)

    Step-by-step event logs with metadata, DOM, and network context

    Natural language explanations for each step and transition

    All flows are depersonalized and privacy-compliant

    Example scripts for ingesting, parsing, and analyzing the dataset

    Flexible licensing for research or commercial use

    Sample Categories Covered

    Grocery delivery (Instacart, Walmart, Kroger, Target, etc.)

    Restaurant takeout/delivery (Ub...

  10. Data from: Example text

    • kaggle.com
    zip
    Updated Sep 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tiansz (2023). Example text [Dataset]. https://www.kaggle.com/datasets/tiansztianszs/example-text
    Explore at:
    zip(662979 bytes)Available download formats
    Dataset updated
    Sep 15, 2023
    Authors
    tiansz
    Description

    Dataset

    This dataset was created by tiansz

    Contents

  11. h

    Example Files to Accompany the Text Book Data Analysis: an Introduction,...

    • harmonydata.ac.uk
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Example Files to Accompany the Text Book Data Analysis: an Introduction, 1961-1992 [Dataset]. http://doi.org/10.5255/UKDA-SN-3208-1
    Explore at:
    Description

    These data are to be used in conjunction with Data Analysis : An Introduction by B. Nolan, available at booksellers.

  12. H

    Replication Data for: Active Learning Approaches for Labeling Text: Review...

    • dataverse.harvard.edu
    • dataone.org
    Updated Dec 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Blake Miller; Fridolin Linder; Walter Mebane
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.

  13. E

    Textual Entailment Dataset

    • live.european-language-grid.eu
    • huggingface.co
    xml
    Updated Dec 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yahoo! Labs (2021). Textual Entailment Dataset [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/8121
    Explore at:
    xmlAvailable download formats
    Dataset updated
    Dec 2, 2021
    Dataset provided by
    University of Rome "Tor Vergata"
    Yahoo! Labs
    Sapienza University of Rome
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    The Textual Entailment dataset contains 800 pairs of Italian sentences, extracted from Wikipedia, and annotated for the presence of textual entailment. A pair of texts consists of T (for text) and H (hypothesis). Textual entailment is defined as a directional relationship between such pairs. The hypothesis must be fully entailed by the text.

    The dataset has been created and used for the Textual Entailment Task (http://www.evalita.it/2009/tasks/te), organised as part of the EVALITA 2009 evaluation campaign (http://www.evalita.it/2009).

    The development and test data each consist of 400 examples of pairs, equally divided into positive and negative examples.

  14. Artificial Intelligence (AI) Text Generator Market Analysis North America,...

    • technavio.com
    pdf
    Updated Jul 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Artificial Intelligence (AI) Text Generator Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, China, India, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/ai-text-generator-market-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jul 12, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2024 - 2028
    Area covered
    United States
    Description

    Snapshot img

    Artificial Intelligence Text Generator Market Size 2024-2028

    The artificial intelligence (AI) text generator market size is forecast to increase by USD 908.2 million at a CAGR of 21.22% between 2023 and 2028.

    The market is experiencing significant growth due to several key trends. One of these trends is the increasing popularity of AI generators in various sectors, including education for e-learning applications. Another trend is the growing importance of speech-to-text technology, which is becoming increasingly essential for improving productivity and accessibility. However, data privacy and security concerns remain a challenge for the market, as generators process and store vast amounts of sensitive information. It is crucial for market participants to address these concerns through strong data security measures and transparent data handling practices to ensure customer trust and compliance with regulations. Overall, the AI generator market is poised for continued growth as it offers significant benefits in terms of efficiency, accuracy, and accessibility.
    

    What will be the Size of the Artificial Intelligence (AI) Text Generator Market During the Forecast Period?

    Request Free Sample

    The market is experiencing significant growth as businesses and organizations seek to automate content creation across various industries. Driven by technological advancements in machine learning (ML) and natural language processing, AI generators are increasingly being adopted for downstream applications in sectors such as education, manufacturing, and e-commerce. 
    Moreover, these systems enable the creation of personalized content for global audiences in multiple languages, providing a competitive edge for businesses in an interconnected Internet economy. However, responsible AI practices are crucial to mitigate risks associated with biased content, misinformation, misuse, and potential misrepresentation.
    

    How is this Artificial Intelligence (AI) Text Generator Industry segmented and which is the largest segment?

    The artificial intelligence (AI) text generator industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    Component
    
      Solution
      Service
    
    
    Application
    
      Text to text
      Speech to text
      Image/video to text
    
    
    Geography
    
      North America
    
        US
    
    
      Europe
    
        Germany
        UK
    
    
      APAC
    
        China
        India
    
    
      South America
    
    
    
      Middle East and Africa
    

    By Component Insights

    The solution segment is estimated to witness significant growth during the forecast period.
    

    Artificial Intelligence (AI) text generators have gained significant traction in various industries due to their efficiency and cost-effectiveness in content creation. These solutions utilize machine learning algorithms, such as Deep Neural Networks, to analyze and learn from vast datasets of human-written text. By predicting the most probable word or sequence of words based on patterns and relationships identified In the training data, AIgenerators produce personalized content for multiple languages and global audiences. The application spans across industries, including education, manufacturing, e-commerce, and entertainment & media. In the education industry, AI generators assist in creating personalized learning materials.

    Get a glance at the Artificial Intelligence (AI) Text Generator Industry report of share of various segments Request Free Sample

    The solution segment was valued at USD 184.50 million in 2018 and showed a gradual increase during the forecast period.

    Regional Analysis

    North America is estimated to contribute 33% to the growth of the global market during the forecast period.
    

    Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

    For more insights on the market share of various regions, Request Free Sample

    The North American market holds the largest share in the market, driven by the region's technological advancements and increasing adoption of AI in various industries. AI text generators are increasingly utilized for content creation, customer service, virtual assistants, and chatbots, catering to the growing demand for high-quality, personalized content in sectors such as e-commerce and digital marketing. Moreover, the presence of tech giants like Google, Microsoft, and Amazon in North America, who are investing significantly in AI and machine learning, further fuels market growth. AI generators employ Machine Learning algorithms, Deep Neural Networks, and Natural Language Processing to generate content in multiple languages for global audiences.

    Market Dynamics

    Our researchers analyzed the data with 2023 as the base year, along with the key drivers, trends, and challenges.

  15. MultiSocial

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    Updated Aug 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
    Explore at:
    Dataset updated
    Aug 20, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

    If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

    Disclaimer

    Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

    Data Source

    The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

    1. Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

    2. Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

    3. Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

    4. Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

    5. WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

    From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

    The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

    The dataset has the following fields:

    • 'text' - a text sample,

    • 'label' - 0 for human-written text, 1 for machine-generated text,

    • 'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

    • 'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

    • 'language' - the ISO 639-1 language code identifying the detected language of the given text,

    • 'length' - word count of the given text,

    • 'source' - a string identifying the source dataset / platform of the given text,

    • 'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

    ToDo Statistics (under construction)

  16. EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural...

    • datarade.ai
    Updated Aug 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). EMEA Data Suite | 3.3M Translations | 1.9M Words | 22 Languages | Natural Language Processing (NLP) Data | Translation Data | TTS | EMEA Coverage [Dataset]. https://datarade.ai/data-products/emea-data-suite-3-3m-translations-1-9m-words-23-languag-oxford-languages
    Explore at:
    .json, .xml, .csv, .xls, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Aug 8, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Uganda, Syrian Arab Republic, Burundi, Israel, Romania, Seychelles, Central African Republic, Spain, Bosnia and Herzegovina, Morocco
    Description

    EMEA Data Suite offers 43 high-quality language datasets covering 23 languages spoken in the region. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the EMEA Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Sentence Corpora Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Arabic Monolingual Dictionary Data: 66,500 words | 98,700 senses | 70,000 example sentences.

    2. Arabic Bilingual Dictionary Data: 116,600 translations | 88,300 senses | 74,700 example translations.

    3. Arabic Synonyms and Antonyms Data: 55,100 synonyms.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) |180,000 audio files.

    7. Catalan Monolingual Dictionary Data: 29,800 words | 47,400 senses | 25,600 example sentences.

    8. Catalan Bilingual Dictionary Data: 76,800 translations | 109,350 senses | 26,900 example translations.

    9. Croatian Monolingual Dictionary Data: 129,600 words | 164,760 senses | 34,630 example sentences.

    10. Croatian Bilingual Dictionary Data: 100,700 translations | 91,600 senses | 10,180 example translations.

    11. Czech Bilingual Dictionary Data: 426,473 translations | 199,800 senses | 95,000 example translations.

    12. Danish Bilingual Dictionary Data: 129,000 translations | 91,500 senses | 23,000 example translations.

    13. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    14. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    15. German Monolingual Dictionary Data: 85,500 words | 78,000 senses | 55,000 example sentences.

    16. German Bilingual Dictionary Data: 393,000 translations | 207,500 senses | 129,500 example translations.

    17. German Word List Data: 338,000 wordforms.

    18. Greek Monolingual Dictionary Data: 47,800 translations | 46,309 senses | 2,388 example sentences.

    19. Hebrew Monolingual Dictionary Data: 85,600 words | 104,100 senses | 94,000 example sentences.

    20. Hebrew Bilingual Dictionary Data: 67,000 translations | 49,000 senses | 19,500 example translations.

    21. Hungarian Monolingual Dictionary Data: 90,500 words | 155,300 senses | 42,500 example sentences.

    22. Italian Monolingual Dictionary Data: 102,500 words | 231,580 senses | 48,200 example sentences.

    23. Italian Bilingual Dictionary Data: 492,000 translations | 251,600 senses | 157,100 example translations.

    24. Italian Synonyms and Antonyms Data: 197,000 synonyms | 62,000 antonyms.

    25. Latvian Monolingual Dictionary Data: 36,000 words | 43,600 senses | 73,600 example sentences.

    26. Polish Bilingual Dictionary Data: 287,400 translations | 216,900 senses | 19,800 example translations.

    27. Portuguese Monolingual Dictionary Data: 143,600 words | 285,500 senses | 69,300 example sentences.

    28. Portuguese Bilingual Dictionary Data: 300,000 translations | 158,000 senses | 117,800 example translations.

    29. Portuguese Synonyms and Antonyms Data: 196,000 synonyms | 90,000 antonyms.

    30. Romanian Monolingual Dictionary Data: 66,900 words | 113,500 senses | 2,700 example sentences.

    31. Romanian Bilingual Dictionary Data: 77,500 translations | 63,870 senses | 33,730 example translations.

    32. Russian Monolingual Dictionary Data: 65,950 words | 57,500 senses | 51,900 example sentences.

    33. Russian Bilingual Dictionary Data: 230,100 translations | 122,200 senses | 69,600 example translations.

    34. Slovak Bilingual Dictionary Dat...

  17. h

    HW1-aug-text-dataset

    • huggingface.co
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jennifer Evans (2025). HW1-aug-text-dataset [Dataset]. https://huggingface.co/datasets/jennifee/HW1-aug-text-dataset
    Explore at:
    Dataset updated
    Oct 9, 2025
    Authors
    Jennifer Evans
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for Book Text Data

    This dataset provides text-based reviews for fiction and nonfiction books.

      Dataset Details
    
    
    
    
    
      Dataset Description
    

    For a selection of books on my bookshelf, I collected some text data. I selected 15 fiction and 15 nonfiction books. I then wrote three reviews for each book to create the first 90 examples, and then I wrote 5 hypothetical fiction book reviews and 5 hypothetical nonfiction book reviews. These reviews were collected… See the full description on the dataset page: https://huggingface.co/datasets/jennifee/HW1-aug-text-dataset.

  18. D

    Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’...

    • dataverse.no
    • dataverse.azure.uit.no
    • +2more
    Updated Oct 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ellen Nierenberg; Ellen Nierenberg (2024). Replication Data for: A Three-Year Mixed Methods Study of Undergraduates’ Information Literacy Development: Knowing, Doing, and Feeling [Dataset]. http://doi.org/10.18710/SK0R1N
    Explore at:
    txt(21865), txt(19475), csv(55030), txt(14751), txt(26578), txt(16861), txt(28211), pdf(107685), pdf(657212), txt(12082), txt(16243), text/x-fixed-field(55030), pdf(65240), txt(8172), pdf(634629), txt(31896), application/x-spss-sav(51476), txt(4141), pdf(91121), application/x-spss-sav(31612), txt(35011), txt(23981), text/x-fixed-field(15653), txt(25369), txt(17935), csv(15653)Available download formats
    Dataset updated
    Oct 8, 2024
    Dataset provided by
    DataverseNO
    Authors
    Ellen Nierenberg; Ellen Nierenberg
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Aug 8, 2019 - Jun 10, 2022
    Area covered
    Norway
    Description

    This data set contains the replication data and supplements for the article "Knowing, Doing, and Feeling: A three-year, mixed-methods study of undergraduates’ information literacy development." The survey data is from two samples: - cross-sectional sample (different students at the same point in time) - longitudinal sample (the same students and different points in time)Surveys were distributed via Qualtrics during the students' first and sixth semesters. Quantitative and qualitative data were collected and used to describe students' IL development over 3 years. Statistics from the quantitative data were analyzed in SPSS. The qualitative data was coded and analyzed thematically in NVivo. The qualitative, textual data is from semi-structured interviews with sixth-semester students in psychology at UiT, both focus groups and individual interviews. All data were collected as part of the contact author's PhD research on information literacy (IL) at UiT. The following files are included in this data set: 1. A README file which explains the quantitative data files. (2 file formats: .txt, .pdf)2. The consent form for participants (in Norwegian). (2 file formats: .txt, .pdf)3. Six data files with survey results from UiT psychology undergraduate students for the cross-sectional (n=209) and longitudinal (n=56) samples, in 3 formats (.dat, .csv, .sav). The data was collected in Qualtrics from fall 2019 to fall 2022. 4. Interview guide for 3 focus group interviews. File format: .txt5. Interview guides for 7 individual interviews - first round (n=4) and second round (n=3). File format: .txt 6. The 21-item IL test (Tromsø Information Literacy Test = TILT), in English and Norwegian. TILT is used for assessing students' knowledge of three aspects of IL: evaluating sources, using sources, and seeking information. The test is multiple choice, with four alternative answers for each item. This test is a "KNOW-measure," intended to measure what students know about information literacy. (2 file formats: .txt, .pdf)7. Survey questions related to interest - specifically students' interest in being or becoming information literate - in 3 parts (all in English and Norwegian): a) information and questions about the 4 phases of interest; b) interest questionnaire with 26 items in 7 subscales (Tromsø Interest Questionnaire - TRIQ); c) Survey questions about IL and interest, need, and intent. (2 file formats: .txt, .pdf)8. Information about the assignment-based measures used to measure what students do in practice when evaluating and using sources. Students were evaluated with these measures in their first and sixth semesters. (2 file formats: .txt, .pdf)9. The Norwegain Centre for Research Data's (NSD) 2019 assessment of the notification form for personal data for the PhD research project. In Norwegian. (Format: .pdf)

  19. Datasets for Sentiment Analysis

    • zenodo.org
    csv
    Updated Dec 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias (2023). Datasets for Sentiment Analysis [Dataset]. http://doi.org/10.5281/zenodo.10157504
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 10, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Julie R. Repository creator - Campos Arias; Julie R. Repository creator - Campos Arias
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository was created for my Master's thesis in Computational Intelligence and Internet of Things at the University of CĂłrdoba, Spain. The purpose of this repository is to store the datasets found that were used in some of the studies that served as research material for this Master's thesis. Also, the datasets used in the experimental part of this work are included.

    Below are the datasets specified, along with the details of their references, authors, and download sources.

    ----------- STS-Gold Dataset ----------------

    The dataset consists of 2026 tweets. The file consists of 3 columns: id, polarity, and tweet. The three columns denote the unique id, polarity index of the text and the tweet text respectively.

    Reference: Saif, H., Fernandez, M., He, Y., & Alani, H. (2013). Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.

    File name: sts_gold_tweet.csv

    ----------- Amazon Sales Dataset ----------------

    This dataset is having the data of 1K+ Amazon Product's Ratings and Reviews as per their details listed on the official website of Amazon. The data was scraped in the month of January 2023 from the Official Website of Amazon.

    Owner: Karkavelraja J., Postgraduate student at Puducherry Technological University (Puducherry, Puducherry, India)

    Features:

    • product_id - Product ID
    • product_name - Name of the Product
    • category - Category of the Product
    • discounted_price - Discounted Price of the Product
    • actual_price - Actual Price of the Product
    • discount_percentage - Percentage of Discount for the Product
    • rating - Rating of the Product
    • rating_count - Number of people who voted for the Amazon rating
    • about_product - Description about the Product
    • user_id - ID of the user who wrote review for the Product
    • user_name - Name of the user who wrote review for the Product
    • review_id - ID of the user review
    • review_title - Short review
    • review_content - Long review
    • img_link - Image Link of the Product
    • product_link - Official Website Link of the Product

    License: CC BY-NC-SA 4.0

    File name: amazon.csv

    ----------- Rotten Tomatoes Reviews Dataset ----------------

    This rating inference dataset is a sentiment classification dataset, containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. On average, these reviews consist of 21 words. The first 5331 rows contains only negative samples and the last 5331 rows contain only positive samples, thus the data should be shuffled before usage.

    This data is collected from https://www.cs.cornell.edu/people/pabo/movie-review-data/ as a txt file and converted into a csv file. The file consists of 2 columns: reviews and labels (1 for fresh (good) and 0 for rotten (bad)).

    Reference: Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL'05), pages 115–124, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

    File name: data_rt.csv

    ----------- Preprocessed Dataset Sentiment Analysis ----------------

    Preprocessed amazon product review data of Gen3EcoDot (Alexa) scrapped entirely from amazon.in
    Stemmed and lemmatized using nltk.
    Sentiment labels are generated using TextBlob polarity scores.

    The file consists of 4 columns: index, review (stemmed and lemmatized review using nltk), polarity (score) and division (categorical label generated using polarity score).

    DOI: 10.34740/kaggle/dsv/3877817

    Citation: @misc{pradeesh arumadi_2022, title={Preprocessed Dataset Sentiment Analysis}, url={https://www.kaggle.com/dsv/3877817}, DOI={10.34740/KAGGLE/DSV/3877817}, publisher={Kaggle}, author={Pradeesh Arumadi}, year={2022} }

    This dataset was used in the experimental phase of my research.

    File name: EcoPreprocessed.csv

    ----------- Amazon Earphones Reviews ----------------

    This dataset consists of a 9930 Amazon reviews, star ratings, for 10 latest (as of mid-2019) bluetooth earphone devices for learning how to train Machine for sentiment analysis.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 5 columns: ReviewTitle, ReviewBody, ReviewStar, Product and division (manually added - categorical label generated using ReviewStar score)

    License: U.S. Government Works

    Source: www.amazon.in

    File name (original): AllProductReviews.csv (contains 14337 reviews)

    File name (edited - used for my research) : AllProductReviews2.csv (contains 9930 reviews)

    ----------- Amazon Musical Instruments Reviews ----------------

    This dataset contains 7137 comments/reviews of different musical instruments coming from Amazon.

    This dataset was employed in the experimental phase of my research. To align it with the objectives of my study, certain reviews were excluded from the original dataset, and an additional column was incorporated into this dataset.

    The file consists of 10 columns: reviewerID, asin (ID of the product), reviewerName, helpful (helpfulness rating of the review), reviewText, overall (rating of the product), summary (summary of the review), unixReviewTime (time of the review - unix time), reviewTime (time of the review (raw) and division (manually added - categorical label generated using overall score).

    Source: http://jmcauley.ucsd.edu/data/amazon/

    File name (original): Musical_instruments_reviews.csv (contains 10261 reviews)

    File name (edited - used for my research) : Musical_instruments_reviews2.csv (contains 7137 reviews)

  20. APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language...

    • datarade.ai
    Updated Oct 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Languages (2025). APAC Data Suite | 4M+ Translations | 1.6M+ Words | Natural Language Processing Data | Dictionary Display | Translations | APAC Coverage [Dataset]. https://datarade.ai/data-products/apac-data-suite-4m-translations-1-6m-words-natural-la-oxford-languages
    Explore at:
    .json, .xml, .csv, .txt, .mp3, .wavAvailable download formats
    Dataset updated
    Oct 1, 2025
    Dataset authored and provided by
    Oxford Languageshttps://lexico.com/es
    Area covered
    Marshall Islands, China, Vietnam, Papua New Guinea, Australia, Kiribati, Fiji, Thailand, Taiwan, Philippines
    Description

    APAC Data Suite offers high-quality language datasets. Ideal for NLP, AI, LLMs, translation, and education, it combines linguistic depth and regional authenticity to power scalable, multilingual language technologies.

    Discover our expertly curated language datasets in the APAC Data Suite. Compiled and annotated by language and linguistic experts, this suite offers high-quality resources tailored to your needs. This suite includes:

    • Monolingual and Bilingual Dictionary Data
      Featuring headwords, definitions, word senses, part-of-speech (POS) tags, and semantic metadata.

    • Semi-bilingual Dictionary Data Each entry features a headword with definitions and/or usage examples in Language 1, followed by a translation of the headword and/or definition in Language 2, enabling efficient cross-lingual mapping.

    • Sentence Corpora
      Curated examples of real-world usage with contextual annotations for training and evaluation.

    • Synonyms & Antonyms
      Lexical relations to support semantic search, paraphrasing, and language understanding.

    • Audio Data
      Native speaker recordings for speech recognition, TTS, and pronunciation modeling.

    • Word Lists
      Frequency-ranked and thematically grouped lists for vocabulary training and NLP tasks. The word list data can cover one language or two, such as Tamil words with English translations.

    Each language may contain one or more types of language data. Depending on the dataset, we can provide these in formats such as XML, JSON, TXT, XLSX, CSV, WAV, MP3, and more. Delivery is currently available via email (link-based sharing) or REST API.

    If you require more information about a specific dataset, please contact us Growth.OL@oup.com.

    Below are the different types of datasets available for each language, along with their key features and approximate metrics. If you have any questions or require additional assistance, please don't hesitate to contact us.

    1. Assamese Semi-bilingual Dictionary Data: 72,200 words | 83,700 senses | 83,800 translations.

    2. Bengali Bilingual Dictionary Data: 161,400 translations | 71,600 senses.

    3. Bengali Semi-bilingual Dictionary Data: 28,300 words | 37,700 senses | 62,300 translations.

    4. British English Monolingual Dictionary Data: 146,000 words | 230,000 senses | 149,000 example sentences.

    5. British English Synonyms and Antonyms Data: 600,000 synonyms | 22,000 antonyms.

    6. British English Pronunciations with Audio: 250,000 transcriptions (IPA) | 180,000 audio files.

    7. French Monolingual Dictionary Data: 42,000 words | 56,000 senses | 43,000 example sentences.

    8. French Bilingual Dictionary Data: 380,000 translations | 199,000 senses | 146,000 example translations.

    9. Gujarati Monolingual Dictionary Data: 91,800 words | 131,500 senses.

    10. Gujarati Bilingual Dictionary Data: 171,800 translations | 158,200 senses.

    11. Hindi Monolingual Dictionary Data: 46,200 words | 112,700 senses.

    12. Hindi Bilingual Dictionary Data: 263,400 translations | 208,100 senses | 18,600 example translations.

    13. Hindi Synonyms and Antonyms Dictionary Data: 478,100 synonyms | 18,800 antonyms.

    14. Hindi Sentence Data: 216,000 sentences.

    15. Hindi Audio data: 68,000 audio files.

    16. Indonesian Bilingual Dictionary Data: 36,000 translations | 23,700 senses | 12,700 example translations.

    17. Indonesian Monolingual Dictionary Data: 120,000 words | 140,000 senses | 30,000 example sentences.

      1. Korean Monolingual Dictionary Data: 596,100 words | 386,600 senses | 91,700 example sentences.
    18. Korean Bilingual Dictionary Data: 952,500 translations | 449,700 senses | 227,800 example translations.

    19. Mandarin Chinese (simplified) Monolingual Dictionary Data: 81,300 words | 162,400 senses | 80,700 example sentences.

    20. Mandarin Chinese (traditional) Monolingual Dictionary Data: 60,100 words | 144,700 senses | 29,900 example sentences.

    21. Mandarin Chinese (simplified) Bilingual Dictionary Data: 367,600 translations | 204,500 senses | 150,900 example translations.

    22. Mandarin Chinese (traditional) Bilingual Dictionary Data: 215,600 translations | 202,800 senses | 149,700 example translations.

    23. Mandarin Chinese (simplified) Synonyms and Antonyms Data: 3,800 synonyms | 3,180 antonyms.

    24. Malay Bilingual Dictionary Data: 106,100 translations | 53,500 senses.

    25. Malay Monolingual Dictionary Data: 39,800 words | 40,600 senses | 21,100 example sentences.

    26. Malayalam Monolingual Dictionary Data: 91,300 words | 159,200 senses.

    27. Malayalam Bilingual Word List Data: 76,200 translation pairs.

    28. Marathi Bilingual Dictionary Data: 45,400 translations | 32,800 senses | 3,600 example translations.

    29. Nepali Bilingual Dictionary Data: 350,000 translations | 264,200 senses | 1,300 example translations.

    30. New Zealand English Monolingual Dictionary Data: 100,000 words

    31. Odia Semi-bilingual Dictionary Data: 30,700 words | 69,300 senses | 69,200 translations.

    32. Punjabi ...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T

Tutorial Package for: Text as Data in Economic Analysis

Explore at:
text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats
Dataset updated
Jun 26, 2025
Dataset provided by
DataverseNL
Authors
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun
License

Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically

Time period covered
Jan 1, 2002 - May 31, 2023
Dataset funded by
Institute for New Economic Thinking
Deutsche Forschungsgemeinschaft (403041268-TRR 266)
Description

This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.

Search
Clear search
Close search
Google apps
Main menu