100+ datasets found

D
Tutorial Package for: Text as Data in Economic Analysis
dataverse.nl
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T
Explore at:
text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats
Unique identifier
https://doi.org/10.34894/KNDZ9T
Dataset updated
Jun 26, 2025
Dataset provided by
DataverseNL
Authors
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Time period covered
Jan 1, 2002 - May 31, 2023
Dataset funded by
Institute for New Economic Thinking
Deutsche Forschungsgemeinschaft (403041268-TRR 266)
Description
This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.
example text data word & CSV format
kaggle.com
zip
Updated Apr 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nohafathi (2025). example text data word & CSV format [Dataset]. https://www.kaggle.com/datasets/nohaaf/example-text-data/discussion?sort=undefined
Explore at:
zip(10243 bytes)Available download formats
Dataset updated
Apr 14, 2025
Authors
nohafathi
Description
Dataset

This dataset was created by nohafathi

Contents
d
SIAM 2007 Text Mining Competition dataset
catalog.data.gov
data.nasa.gov
+1more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
h
text-classification-dataset-example
huggingface.co
Updated Feb 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chien-Wei Chang (2024). text-classification-dataset-example [Dataset]. https://huggingface.co/datasets/cwchang/text-classification-dataset-example
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Authors
Chien-Wei Chang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
cwchang/text-classification-dataset-example dataset hosted on Hugging Face and contributed by the HF Datasets community
Data from: Example text
kaggle.com
zip
Updated Sep 15, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tiansz (2023). Example text [Dataset]. https://www.kaggle.com/datasets/tiansztianszs/example-text
Explore at:
zip(662979 bytes)Available download formats
Dataset updated
Sep 15, 2023
Authors
tiansz
Description
Dataset

This dataset was created by tiansz

Contents
Text Document Classification Dataset
kaggle.com
zip
Updated Dec 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
Explore at:
zip(1941393 bytes)Available download formats
Dataset updated
Dec 4, 2023
Authors
sunil thite
Description
This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

Politics = 0

Sport = 1

Technology = 2

Entertainment =3

Business = 4
Text files from Gutenberg database
zenodo.org
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antonis Michalas; Antonis Michalas (2020). Text files from Gutenberg database [Dataset]. http://doi.org/10.5281/zenodo.3360392
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3360392
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Antonis Michalas; Antonis Michalas
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text files of different size and structure. More precisely, we selected random data from the Gutenberg dataset.

This artefact contains five different datasets with random text files (i.e. e-books in .txt format) from the Gutenberg database. The datasets that we selected ranged from text files with a total size of 184MB to a set of text files with a total size of 1.7GB.

More precisely, the following datasets can be found in this package:

184MB

357MB

670MB

1GB

1.7GB

In our case, we used this dataset to perform extensive experiments on regarding the performance of a Symmetric Searchable Encryption scheme. However, this dataset can be used to measure the performance of any algorithm that is parsing documents, extracting keywords, creates dictionaries etc.
sample-text-dataset
kaggle.com
zip
Updated May 7, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manish Kumar Mishra (2018). sample-text-dataset [Dataset]. https://www.kaggle.com/manish341/sample-text-file
Explore at:
zip(233 bytes)Available download formats
Dataset updated
May 7, 2018
Authors
Manish Kumar Mishra
Description
Dataset

This dataset was created by Manish Kumar Mishra

Contents
d
HTMLmetadata HTML formatted text files describing samples and spectra,...
catalog.data.gov
datasets.ai
+1more
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). HTMLmetadata HTML formatted text files describing samples and spectra, including photos [Dataset]. https://catalog.data.gov/dataset/htmlmetadata-html-formatted-text-files-describing-samples-and-spectra-including-photos
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
U.S. Geological Survey
Description
HTMLmetadata Text files in HTML-format containing metadata about samples and spectra. Also included in the zip file are folders containing information linked to from the HTML files, including: - README: contains a HTML version of the USGS Data Series publication, linked to this data release, that describes this spectral library (Kokaly and others, 2017). The folder also contains an HTML version of the release notes. - photo_images: contains full resolution images of photos of samples and field sites. - photo_thumbs: contains low-resolution thumbnail versions of photos of samples and field sites. GENERAL LIBRARY DESCRIPTION This data release provides the U.S. Geological Survey (USGS) Spectral Library Version 7 and all related documents. The library contains spectra measured with laboratory, field, and airborne spectrometers. The instruments used cover wavelengths from the ultraviolet to the far infrared (0.2 to 200 microns). Laboratory samples of specific minerals, plants, chemical compounds, and man-made materials were measured. In many cases, samples were purified, so that unique spectral features of a material can be related to its chemical structure. These spectro-chemical links are important for interpreting remotely sensed data collected in the field or from an aircraft or spacecraft. This library also contains physically-constructed as well as mathematically-computed mixtures. Measurements of rocks, soils, and natural mixtures of minerals have also been made with laboratory and field spectrometers. Spectra of plant components and vegetation plots, comprising many plant types and species with varying backgrounds, are also in this library. Measurements by airborne spectrometers are included for forested vegetation plots, in which the trees are too tall for measurement by a field spectrometer. The related U.S. Geological Survey Data Series publication, "USGS Spectral Library Version 7", describes the instruments used, metadata descriptions of spectra and samples, and possible artifacts in the spectral measurements (Kokaly and others, 2017). Four different spectrometer types were used to measure spectra in the library: (1) Beckman™ 5270 covering the spectral range 0.2 to 3 µm, (2) standard, high resolution (hi-res), and high-resolution Next Generation (hi-resNG) models of ASD field portable spectrometers covering the range from 0.35 to 2.5 µm, (3) Nicolet™ Fourier Transform Infra-Red (FTIR) interferometer spectrometers covering the range from about 1.12 to 216 µm, and (4) the NASA Airborne Visible/Infra-Red Imaging Spectrometer AVIRIS, covering the range 0.37 to 2.5 µm. Two fundamental spectrometer characteristics significant for interpreting and utilizing spectral measurements are sampling position (the wavelength position of each spectrometer channel) and bandpass (a parameter describing the wavelength interval over which each channel in a spectrometer is sensitive). Bandpass is typically reported as the Full Width at Half Maximum (FWHM) response at each channel (in wavelength units, for example nm or micron). The linked publication (Kokaly and others, 2017), includes a comparison plot of the various spectrometers used to measure the data in this release. Data for the sampling positions and the bandpass values (for each channel in the spectrometers) are included in this data release. These data are in the SPECPR files, as separate data records, and in the American Standard Code for Information Interchange (ASCII) text files, as separate files for wavelength and bandpass. Spectra are provided in files of ASCII text format (files with a .txt file extension). In the ASCII files, deleted channels (bad bands) are indicated by a value of -1.23e34. Metadata descriptions of samples, field areas, spectral measurements, and results from supporting material analyses – such as XRD – are provided in HyperText Markup Language HTML formatted ASCII text files (files with .html file extension). In addition, Graphics Interchange Format (GIF) images of plots of spectra are provided. For each spectrum a plot with wavelength in microns on the x-axis is provided. For spectra measured on the Nicolet spectrometer, an additional GIF image with wavenumber on the x-axis is provided. Data are also provided in SPECtrum Processing Routines (SPECPR) format (Clark, 1993) which packages spectra and associated metadata descriptions into a single file (see the linked publication, Kokaly and others, 2017, for additional details on the SPECPR format and freely-available software than can be used to read files in SPECPR format). The data measured on the source spectrometers are denoted by the “splib07a” tag in filenames. In addition to providing the original measurements, the spectra have been convolved and resampled to different spectrometer and multispectral sensor characteristics. The following list specifies the identifying tag for the measured and convolved libraries and gives brief descriptions of the sensors. splib07a – this is the name of the SPECPR file containing the spectra measured on the Beckman, ASD, Nicolet and AVIRIS spectrometers. The data are provided with their original sampling positions (wavelengths) and bandpass values. The prefix “splib07a_” is at the beginning of the ASCII and GIF files pertaining to the measured spectra. splib07b – this is the name of the SPECPR file containing a modified version of the original measurements. The results from using spectral convolution to convert measurements to other spectrometer characteristics can be improved by oversampling (increasing sample density). Thus, splib07b is an oversampled version of the library, computed using simple cubic-spline interpolation to produce spectra with fine sampling interval (therefore a higher number of channels) for Beckman and AVIRIS measurements. The spectra in this version of the library are the data used to create the convolved and resampled versions of the library. The prefix “splib07b_” is at the beginning of the ASCII and GIF files pertaining to the oversampled spectra. s07_ASD – this is the name of the SPECPR file containing the spectral library measurements convolved to standard resolution ASD full range spectrometer characteristics. The standard reported wavelengths of the ASD spectrometers used by the USGS were used (2151 channels with wavelength positions starting at 350 nm and increasing in 1 nm increments). The bandpass values of each channel were determined by comparing measurements of reference materials made on ASD spectrometers in comparison to measurements made of the same materials on higher resolution spectrometers (the procedure is described in Kokaly, 2011, and discussed in Kokaly and Skidmore, 2015, and Kokaly and others, 2017). The prefix “s07ASD_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV95 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1995 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV95_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV96 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1996 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV96_” is at the beginning of the ASCII, and GIF files. s07_AV97 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1997 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV97_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV98 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1998 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV98_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV99 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 1999 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV99_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV00 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2000 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV00_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV01 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2001 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV01_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV05 – this is the name of the SPECPR file containing the spectral library measurements convolved to AVIRIS-Classic with spectral characteristics determined in the year 2005 (wavelength and bandpass values for the 224 channels provided with AVIRIS data by NASA/JPL). The prefix “s07_AV05_” is at the beginning of the ASCII and GIF files pertaining to this spectrometer. s07_AV06 – this is the name of the SPECPR file containing the spectral library measurements convolved to
feedback prize: line-by-line text dataset
kaggle.com
zip
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Broad (2021). feedback prize: line-by-line text dataset [Dataset]. https://www.kaggle.com/datasets/nbroad/feedback-prize-linebyline-text-dataset
Explore at:
zip(12238059 bytes)Available download formats
Dataset updated
Dec 16, 2021
Authors
Nicholas Broad
Description
If you would like to pre-train a model on the text data in the Feedback Prize competition, here you go! It is one text file with a single sentence per line. Duplicate lines and lines with fewer than 20 characters are removed.

Code to recreate here: https://www.kaggle.com/nbroad/line-by-line-dataset

Use it in transformers scripts using the --line_by_line flag.

run_mlm.py

run_mlm_no_trainer.py
h
Example Files to Accompany the Text Book Data Analysis: an Introduction,...
harmonydata.ac.uk
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Example Files to Accompany the Text Book Data Analysis: an Introduction, 1961-1992 [Dataset]. http://doi.org/10.5255/UKDA-SN-3208-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-3208-1
Description
These data are to be used in conjunction with Data Analysis : An Introduction by B. Nolan, available at booksellers.
m
Text Script Analytics Code for Automatic Video Generation
data.mendeley.com
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gaganpreet gagan (2025). Text Script Analytics Code for Automatic Video Generation [Dataset]. http://doi.org/10.17632/kgngzzs5c8.5
Explore at:
Unique identifier
https://doi.org/10.17632/kgngzzs5c8.5
Dataset updated
Aug 22, 2025
Authors
gaganpreet gagan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This Python notebook (research work) provides a comprehensive solution for text analysis and hint extraction that will be useful for making computational scenes using input text .

It includes a collection of functions that can be used to preprocess textual data, extract information such as characters, relationships, emotions, dates, times, addresses, locations, purposes, and hints from the text.

Key Features:

Preprocessing Collected Data: The notebook offers preprocessing capabilities to remove unwanted strings, normalize text data, and prepare it for further analysis. Character Extraction: The notebook includes functions to extract characters from the text, count the number of characters, and determine the number of male and female characters. Relationship Extraction: Functions are provided to calculate possible relationships among characters and extract the relationship names. Dominant Emotion Extraction: The notebook includes a function to extract the dominant emotion from the text. Date and Time Extraction: Functions are available to extract dates and times from the text, including handling phrases like "before," "after," "in the morning," and "in the evening." Address and Location Extraction: The notebook provides functions to extract addresses and locations from the text, including identifying specific places like offices, homes, rooms, or bathrooms. Purpose Extraction: Functions are included to extract the purpose of the text. Hint Collection: The notebook offers the ability to collect hints from the text based on specific keywords or phrases. Sample Implementations: Sample Python code is provided for each function, demonstrating how to use them effectively. This notebook serves as a valuable resource for text analysis tasks, assisting in extracting essential information and hints from textual data. It can be used in various domains such as natural language processing, sentiment analysis, and information retrieval. The code is well-documented and can be easily integrated into existing projects or workflows.
EP full-text data for text analytics
data.europa.eu
csv
Updated Oct 15, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
European Patent Office (2021). EP full-text data for text analytics [Dataset]. https://data.europa.eu/88u/dataset/https-www-epo-org-searching-for-patents-data-bulk-data-sets-text-analytics-dataset
Explore at:
csvAvailable download formats
Dataset updated
Oct 15, 2021
Dataset authored and provided by
European Patent Officehttp://www.epo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A bulk data set consisting of XML-tagged titles, abstracts, descriptions, claims and search reports of European Patent (EP) publications, designed to facilitate natural language processing work.
h
2025-24679-hw1-text-dataset-mkarthik
huggingface.co
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Madhav Karthikeyakannan (2025). 2025-24679-hw1-text-dataset-mkarthik [Dataset]. https://huggingface.co/datasets/madhavkarthi/2025-24679-hw1-text-dataset-mkarthik
Explore at:
Dataset updated
Oct 2, 2025
Authors
Madhav Karthikeyakannan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Dataset Card for Synthetic Text Dataset - Augmentation Example

Dataset Summary

This dataset demonstrates text data augmentation. Starting from 100 original short text samples, multiple augmentation techniques were applied to expand the dataset to 1,000 samples.

Purpose

The dataset was created as part of a course exercise to explore text augmentation and its effect on classification tasks.

Composition

Instances: 100 original + 1200 augmented = 1,300… See the full description on the dataset page: https://huggingface.co/datasets/madhavkarthi/2025-24679-hw1-text-dataset-mkarthik.
H
Replication Data for: Active Learning Approaches for Labeling Text: Review...
dataverse.harvard.edu
dataone.org
Updated Dec 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T88EAX
Dataset updated
Dec 11, 2019
Dataset provided by
Harvard Dataverse
Authors
Blake Miller; Fridolin Linder; Walter Mebane
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.
MultiSocial
zenodo.org
data.niaid.nih.gov
+1more
Updated Aug 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal (2025). MultiSocial [Dataset]. http://doi.org/10.5281/zenodo.13846152
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.13846152
Dataset updated
Aug 20, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Dominik Macko; Dominik Macko; Jakub Kopal; Robert Moro; Robert Moro; Ivan Srba; Ivan Srba; Jakub Kopal
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiSocial is a dataset (described in a paper) for multilingual (22 languages) machine-generated text detection benchmark in social-media domain (5 platforms). It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual large language models by using 3 iterations of paraphrasing. The dataset has been anonymized to minimize amount of sensitive data by hiding email addresses, usernames, and phone numbers.

If you use this dataset in any publication, project, tool or in any other form, please, cite the paper.

Disclaimer

Due to data source (described below), the dataset may contain harmful, disinformation, or offensive content. Based on a multilingual toxicity detector, about 8% of the text samples are probably toxic (from 5% in WhatsApp to 10% in Twitter). Although we have used data sources of older date (lower probability to include machine-generated texts), the labeling (of human-written text) might not be 100% accurate. The anonymization procedure might not successfully hiden all the sensitive/personal content; thus, use the data cautiously (if feeling affected by such content, report the found issues in this regard to dpo[at]kinit.sk). The intended use if for non-commercial research purpose only.

Data Source

The human-written part consists of a pseudo-randomly selected subset of social media posts from 6 publicly available datasets:

Telegram data originated in Pushshift Telegram, containing 317M messages (Baumgartner et al., 2020). It contains messages from 27k+ channels. The collection started with a set of right-wing extremist and cryptocurrency channels (about 300 in total) and was expanded based on occurrence of forwarded messages from other channels. In the end, it thus contains a wide variety of topics and societal movements reflecting the data collection time.

Twitter data originated in CLEF2022-CheckThat! Task 1, containing 34k tweets on COVID-19 and politics (Nakov et al., 2022, combined with Sentiment140, containing 1.6M tweets on various topics (Go et al., 2009).

Gab data originated in the dataset containing 22M posts from Gab social network. The authors of the dataset (Zannettou et al., 2018) found out that “Gab is predominantly used for the dissemination and discussion of news and world events, and that it attracts alt-right users, conspiracy theorists, and other trolls.” They also found out that hate speech is much more prevalent there compared to Twitter, but lower than 4chan's Politically Incorrect board.

Discord data originated in Discord-Data, containing 51M messages. This is a long-context, anonymized, clean, multi-turn and single-turn conversational dataset based on Discord data scraped from a large variety of servers, big and small. According to the dataset authors, it contains around 0.1% of potentially toxic comments (based on the applied heuristic/classifier).

WhatsApp data originated in whatsapp-public-groups, containing 300k messages (Garimella & Tyson, 2018). The public dataset contains the anonymised data, collected for around 5 months from around 178 groups. Original messages were made available to us on request to dataset authors for research purposes.

From these datasets, we have pseudo-randomly sampled up to 1300 texts (up to 300 for test split and the remaining up to 1000 for train split if available) for each of the selected 22 languages (using a combination of automated approaches to detect the language) and platform. This process resulted in 61,592 human-written texts, which were further filtered out based on occurrence of some characters or their length, resulting in about 58k human-written texts.

The machine-generated part contains texts generated by 7 LLMs (Aya-101, Gemini-1.0-pro, GPT-3.5-Turbo-0125, Mistral-7B-Instruct-v0.2, opt-iml-max-30b, v5-Eagle-7B-HF, vicuna-13b). All these models were self-hosted except for GPT and Gemini, where we used the publicly available APIs. We generated the texts using 3 paraphrases of the original human-written data and then preprocessed the generated texts (filtered out cases when the generation obviously failed).

The dataset has the following fields:

'text' - a text sample,

'label' - 0 for human-written text, 1 for machine-generated text,

'multi_label' - a string representing a large language model that generated the text or the string "human" representing a human-written text,

'split' - a string identifying train or test split of the dataset for the purpose of training and evaluation respectively,

'language' - the ISO 639-1 language code identifying the detected language of the given text,

'length' - word count of the given text,

'source' - a string identifying the source dataset / platform of the given text,

'potential_noise' - 0 for text without identified noise, 1 for text with potential noise.

ToDo Statistics (under construction)
BL Newspapers sample plain-text data
zenodo.org
zip
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann Ryan; Yann Ryan (2023). BL Newspapers sample plain-text data [Dataset]. http://doi.org/10.5281/zenodo.8262356
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8262356
Dataset updated
Aug 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yann Ryan; Yann Ryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of .csv files each containing article texts from newspapers published on the Shared Research Repository.
Ecommerce Text Classification
kaggle.com
zip
Updated Oct 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saurabh Shahane (2023). Ecommerce Text Classification [Dataset]. https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification
Explore at:
zip(8236809 bytes)Available download formats
Dataset updated
Oct 9, 2023
Authors
Saurabh Shahane
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the classification based E-commerce text dataset for 4 categories - "Electronics", "Household", "Books" and "Clothing & Accessories", which almost cover 80% of any E-commerce website.

The dataset is in ".csv" format with two columns - the first column is the class name and the second one is the datapoint of that class. The data point is the product and description from the e-commerce website.

The dataset has the following features :

Data Set Characteristics: Multivariate

Number of Instances: 50425

Number of classes: 4

Area: Computer science

Attribute Characteristics: Real

Number of Attributes: 1

Associated Tasks: Classification

Missing Values? No

Gautam. (2019). E commerce text dataset (version - 2) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3355823
d
Data from: ViTexOCR; a script to extract text overlays from digital video
catalog.data.gov
data.usgs.gov
+4more
Updated Nov 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). ViTexOCR; a script to extract text overlays from digital video [Dataset]. https://catalog.data.gov/dataset/vitexocr-a-script-to-extract-text-overlays-from-digital-video
Explore at:
Dataset updated
Nov 19, 2025
Dataset provided by
U.S. Geological Survey
Description
The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.
American English Language Datasets | 150+ Years of Research | Textual Data |...
datarade.ai
Updated Jul 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Languages (2025). American English Language Datasets | 150+ Years of Research | Textual Data | Audio Data | Natural Language Processing (NLP) Data | US English Coverage [Dataset]. https://datarade.ai/data-products/american-english-language-datasets-150-years-of-research-oxford-languages
Explore at:
.json, .xml, .csv, .xls, .mp3, .wavAvailable download formats
Dataset updated
Jul 29, 2025
Dataset authored and provided by
Oxford Languageshttps://lexico.com/es
Area covered
United States
Description
Derived from over 150 years of lexical research, these comprehensive textual and audio data, focused on American English, provide linguistically annotated data. Ideal for NLP applications, LLM training and/or fine-tuning, as well as educational and game apps.

One of our flagship datasets, the American English data is expertly curated and linguistically annotated by professionals, with annual updates to ensure accuracy and relevance. The below datasets in American English are available for license:

American English Monolingual Dictionary Data

American English Synonyms and Antonyms Data

American English Pronunciations with Audio

Key Features (approximate numbers):

American English Monolingual Dictionary Data

Our American English Monolingual Dictionary Data is the foremost authority on American English, including detailed tagging and labelling covering parts of speech (POS), grammar, region, register, and subject, providing rich linguistic information. Additionally, all grammar and usage information is present to ensure relevance and accuracy.

Headwords: 140,000

Senses: 222,000

Sentence examples: 140,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Synonyms and Antonyms Data

The American English Synonyms and Antonyms Dataset is a leading resource offering comprehensive, up-to-date coverage of word relationships in contemporary American English. It includes rich linguistic details such as precise definitions and part-of-speech (POS) tags, making it an essential asset for developing AI systems and language technologies that require deep semantic understanding.

Synonyms: 600,000

Antonyms: 22,000

Format: XML and JSON format

Delivery: Email (link-based file sharing) and REST API

Updated frequency: annually

American English Pronunciations with Audio (word-level)

This dataset provides IPA transcriptions and clean audio data in contemporary American English. It includes syllabified transcriptions, variant spellings, POS tags, and pronunciation group identifiers. The audio files are supplied separately and linked where available for seamless integration - perfect for teams building TTS systems, ASR models, and pronunciation engines.

Transcriptions (IPA): 250,000

Audio files: 180,000

Format: XLSX (for transcriptions), MP3 and WAV (audio files)

Updated frequency: annually

Use Cases:

We consistently work with our clients on new use cases as language technology continues to evolve. These include NLP applications, TTS, dictionary display tools, games, translation machine, AI training and fine-tuning, word embedding, and word sense disambiguation (WSD).

If you have a specific use case in mind that isn't listed here, we’d be happy to explore it with you. Don’t hesitate to get in touch with us at Growth.OL@oup.com to start the conversation.

Pricing:

Oxford Languages offers flexible pricing based on use case and delivery format. Our datasets are licensed via term-based IP agreements and tiered pricing for API-delivered data. Whether you’re integrating into a product, training an LLM, or building custom NLP solutions, we tailor licensing to your specific needs.

Contact our team or email us at Growth.OL@oup.com to explore pricing options and discover how our language data can support your goals. Please note that some datasets may have rights restrictions. Contact us for more information.

About the sample:

To help you explore the structure and features of our dataset on this platform, we provide a sample in CSV and/or JSON formats for one of the presented datasets, for preview purposes only, as shown on this page. This sample offers a quick and accessible overview of the data's contents and organization.

Our full datasets are available in various formats, depending on the language and type of data you require. These may include XML, JSON, TXT, XLSX, CSV, WAV, MP3, and other file types. Please contact us (Growth.OL@oup.com) if you would like to receive the original sample with full details.

Facebook

Twitter

Click to copy link

Link copied

Cite

Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T

Tutorial Package for: Text as Data in Economic Analysis

Explore at:

text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats

Unique identifier

https://doi.org/10.34894/KNDZ9T

Dataset updated

Jun 26, 2025

Dataset provided by

DataverseNL

Authors

License

Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically

Time period covered

Jan 1, 2002 - May 31, 2023

Dataset funded by

Institute for New Economic Thinking
Deutsche Forschungsgemeinschaft (403041268-TRR 266)

Description

This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.

Clear search

Close search

Google apps

Main menu

Tutorial Package for: Text as Data in Economic Analysis

example text data word & CSV format

Dataset

Contents

SIAM 2007 Text Mining Competition dataset

text-classification-dataset-example

Data from: Example text

Dataset

Contents

Text Document Classification Dataset

Text files from Gutenberg database

sample-text-dataset

Dataset

Contents

HTMLmetadata HTML formatted text files describing samples and spectra,...

feedback prize: line-by-line text dataset

Example Files to Accompany the Text Book Data Analysis: an Introduction,...

Text Script Analytics Code for Automatic Video Generation

EP full-text data for text analytics

2025-24679-hw1-text-dataset-mkarthik

Replication Data for: Active Learning Approaches for Labeling Text: Review...

MultiSocial

Disclaimer

Data Source

BL Newspapers sample plain-text data

Ecommerce Text Classification

Data from: ViTexOCR; a script to extract text overlays from digital video

American English Language Datasets | 150+ Years of Research | Textual Data |...

Tutorial Package for: Text as Data in Economic Analysis