100+ datasets found

h
text-clustering-example-data
huggingface.co
Updated Nov 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2024
Authors
Jacob Moore
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Dataset Card for Dataset Name

This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

Direct Use

This data can be loaded using the following Python code. from datasets import load_dataset

ds = load_dataset('billingsmoore/text-clustering-example-data')

It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.
d
SIAM 2007 Text Mining Competition dataset
catalog.data.gov
data.nasa.gov
+2more
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
Dashlink
Description
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
q
Sample Data
data.researchdatafinder.qut.edu.au
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sample Data [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/discriminate-short-text2/resource/ffcc10ed-7592-4474-8e62-2de30002c845
Explore at:
Dataset updated
May 31, 2023
License
http://researchdatafinder.qut.edu.au/display/n124876http://researchdatafinder.qut.edu.au/display/n124876
Description
Sample data of text QUT Research Data Respository Dataset Resource available for download
Sample text data
zenodo.org
txt
Updated May 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paul Kilgarriff; Paul Kilgarriff (2023). Sample text data [Dataset]. http://doi.org/10.5281/zenodo.7944136
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7944136
Dataset updated
May 18, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Paul Kilgarriff; Paul Kilgarriff
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is just a sample text data.
h
text-classification-dataset-example
huggingface.co
Updated Feb 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chien-Wei Chang (2024). text-classification-dataset-example [Dataset]. https://huggingface.co/datasets/cwchang/text-classification-dataset-example
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 7, 2024
Authors
Chien-Wei Chang
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
cwchang/text-classification-dataset-example dataset hosted on Hugging Face and contributed by the HF Datasets community
AI vs Human Generated Contents
kaggle.com
Updated Oct 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Asfaq Ahmed 456 (2024). AI vs Human Generated Contents [Dataset]. https://www.kaggle.com/datasets/asfaqahmed456/ai-vs-human-generated-contents
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 6, 2024
Dataset provided by
Kaggle
Authors
Asfaq Ahmed 456
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
A dataset with 10 text samples. Each sample is labeled as either AI-generated (1) or human-generated (0). This dataset is suitable for text classification tasks such as detecting AI-generated content.

This file contains text samples that are either generated by AI models or written by humans. Each entry is labeled to indicate whether the content is AI-generated or human-generated. This dataset can be used for various natural language processing tasks such as text classification, content analysis, and AI content detection. ** Column 1: text** Description: "The actual content (text data), which may be a short paragraph or sentence. This is the primary feature for analysis." Data Type: String (Text) Column 2: label Description: "Binary label indicating whether the content is AI-generated or human-generated. '0' represents human-generated, and '1' represents AI-generated." Data Type: Integer (0 or 1)

The AI-generated content was created using advanced language models such as GPT-4, which were instructed to write text on various topics. The human-generated content was sourced from publicly available texts, including articles, blogs, and creative writing samples found on the internet. Care has been taken to ensure that all human-generated content is in the public domain or shared with permission, without any identifiable information

This dataset is static and will not receive regular updates. However, future versions may be released if new data becomes available or if users contribute additional examples to enhance the dataset.
QuitNowTXT Text Messaging Library
healthdata.gov
data.virginia.gov
+2more
application/rdfxml +5
Updated Feb 13, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). QuitNowTXT Text Messaging Library [Dataset]. https://healthdata.gov/w/ks37-e557/default?cur=4E5z_TlScUJ
Explore at:
tsv, csv, application/rssxml, xml, json, application/rdfxmlAvailable download formats
Dataset updated
Feb 13, 2021
Description
Overview: The QuitNowTXT text messaging program is designed as a resource that can be adapted to specific contexts including those outside the United States and in languages other than English. Based on evidence-based practices, this program is a smoking cessation intervention for smokers who are ready to quit smoking. Although evidence supports the use of text messaging as a platform to deliver cessation interventions, it is expected that the maximum effect of the program will be demonstrated when it is integrated into other elements of a national tobacco control strategy. The QuitNowTXT program is designed to deliver tips, motivation, encouragement and fact-based information via unidirectional and interactive bidirectional message formats. The core of the program consists of messages sent to the user based on a scheduled quit day identified by the user. Messages are sent for up to four weeks pre-quit date and up to six weeks post quit date. Messages assessing mood, craving, and smoking status are also sent at various intervals, and the user receives messages back based on the response they have submitted. In addition, users can request assistance in dealing with craving, stress/mood, and responding to slips/relapses by texting specific key words to the QuitNow. Rotating automated messages are then returned to the user based on the keyword. Details of the program are provided below. Texting STOP to the service discontinues further texts being sent. This option is provided every few messages as required by the United States cell phone providers. It is not an option to remove this feature if the program is used within the US. If a web-based registration is used, it is suggested that users provide demographic information such as age, sex, and smoking frequency (daily or almost every day, most days, only a few days a week, only on weekends, a few times a month or less) in addition to their mobile phone number and quit date. This information will be useful for assessing the reach of the program, as well as identifying possible need to develop libraries to specific groups. The use of only a mobile phone-based registration system reduces barriers for participant entry into the program but limits the collection of additional data. At bare minimum, quit date must be collected. At sign up, participants will have the option to choose a quit date up to one month out. Text messages will start up to 14 days before their specified quit date. Users also have the option of changing their quit date at any time if desired. The program can also be modified to provide texts to users who have already quit within the last month. One possible adaptation of the program is to include a QuitNowTXT "light" version. This adaptation would allow individuals who do not have unlimited text messaging capabilities but would still like to receive support to participate by controlling the number of messages they receive. In the light program, users can text any of the programmed keywords without fully opting in to the program. Program Design: The program is designed as a 14-day countdown to quit date, with subsequent six weeks of daily messages. Each day within the program is identified as either a pre-quit date (Q- # days) or a post-quit date (Q+#). If a user opts into the program fewer than 14 days before their quit date, the system will begin sending messages on that day. For example, if they opt in four days prior to their quit date, the system will send a welcome message and recognize that they are at Q-4 (or four days before their quit date), and they will receive the message that everyone else receives four days before their quit date. As the user progresses throughout the program, they will receive messages outlined in the text message library. Throughout the program, users will receive texts that cover a variety of content areas including tips, informational content, motivational messaging, and keyword responses. The frequency of messages increases as the days leading up to and following the user's quit date, with a heavy emphasis on support, efficacy building, and actionable tips. Further away from a user's quit date, the messages will reduce in frequency. If the user says they have started to smoke again, the system will give them the option of continuing the program as planned or starting over and setting a new quit date. The system is also designed to assess the user's mood, craving level, and smoking status several times during the program. These assessment messages are characterized as MOOD, CRAVE, and STATUS messages. Whenever the system asks for a response from the user, it will send a programmed response based on the user's answer (i.e., if the user responds with MOOD = BAD then they will receive a message customized to that response). These programmed response messages rotate throughout the course of the program. Users can also send the system one of three programmed keywords (CRAVE, MOOD, and SLIP), and the system will send unique, automated responses based on the texted keyword. There are 10 messages for each of the programmed keywords, which rotate on a random basis, decreasing the likelihood the user will get the same response in a row. After the full six-week program comes to an end, the system will follow up at one, three, and six months to check on the user's smokefree status and offer additional assistance if needed. Message Types: -'''¢ Tips: Tips provide users with actionable strategies on how to manage cravings and deal with quitting smoking in general. -'''¢ Motivation/encouragement: Motivational messages encourage users to keep going on their smokefree journey despite the difficulty and struggle they may be facing. -'''¢ Information: Informational messages provide users with facts and other salient points about the impact of smoking relevant to their socio-cultural environment. -'''¢ Assessment: The assessment messages are built into the text messaging program and are designed to collect information about the user's experience as they are quitting and provide immediate feedback based on the user's response. Assessment messages fall along three dimensions: mood, craving, and smoking status. -'''¢ Reactive Messaging (Key Words): At any point, the user can initiate an interaction with the program that will return a text message relevant to the user's request for help. In response to the user texting one of the key words, the system will send them unique, automated responses. The key words cover topics relevant to various aspects of cessation.
D
Tutorial Package for: Text as Data in Economic Analysis
dataverse.nl
Updated Jun 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T
Explore at:
text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats
Unique identifier
https://doi.org/10.34894/KNDZ9T
Dataset updated
Jun 26, 2025
Dataset provided by
DataverseNL
Authors
Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun
License
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
Time period covered
Jan 1, 2002 - May 31, 2023
Dataset funded by
Institute for New Economic Thinking
Deutsche Forschungsgemeinschaft (403041268-TRR 266)
Description
This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.
BL Newspapers sample plain-text data
zenodo.org
zip
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yann Ryan; Yann Ryan (2023). BL Newspapers sample plain-text data [Dataset]. http://doi.org/10.5281/zenodo.8262356
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8262356
Dataset updated
Aug 19, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Yann Ryan; Yann Ryan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A dataset of .csv files each containing article texts from newspapers published on the Shared Research Repository.
Artificial Intelligence (AI) Text Generator Market Analysis North America,...
technavio.com
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2024). Artificial Intelligence (AI) Text Generator Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, China, India, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/ai-text-generator-market-analysis
Explore at:
Dataset updated
Jul 15, 2024
Dataset provided by
TechNavio
Authors
Technavio
Time period covered
2021 - 2025
Area covered
United States, Global
Description
Snapshot img

Artificial Intelligence Text Generator Market Size 2024-2028

The artificial intelligence (AI) text generator market size is forecast to increase by USD 908.2 million at a CAGR of 21.22% between 2023 and 2028.

The market is experiencing significant growth due to several key trends. One of these trends is the increasing popularity of AI generators in various sectors, including education for e-learning applications. Another trend is the growing importance of speech-to-text technology, which is becoming increasingly essential for improving productivity and accessibility. However, data privacy and security concerns remain a challenge for the market, as generators process and store vast amounts of sensitive information. It is crucial for market participants to address these concerns through strong data security measures and transparent data handling practices to ensure customer trust and compliance with regulations. Overall, the AI generator market is poised for continued growth as it offers significant benefits in terms of efficiency, accuracy, and accessibility.

What will be the Size of the Artificial Intelligence (AI) Text Generator Market During the Forecast Period?

Request Free Sample

The market is experiencing significant growth as businesses and organizations seek to automate content creation across various industries. Driven by technological advancements in machine learning (ML) and natural language processing, AI generators are increasingly being adopted for downstream applications in sectors such as education, manufacturing, and e-commerce. Moreover, these systems enable the creation of personalized content for global audiences in multiple languages, providing a competitive edge for businesses in an interconnected Internet economy. However, responsible AI practices are crucial to mitigate risks associated with biased content, misinformation, misuse, and potential misrepresentation.

How is this Artificial Intelligence (AI) Text Generator Industry segmented and which is the largest segment?

The artificial intelligence (AI) text generator industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

Component Solution Service Application Text to text Speech to text Image/video to text Geography North America US Europe Germany UK APAC China India South America Middle East and Africa

By Component Insights

The solution segment is estimated to witness significant growth during the forecast period.

Artificial Intelligence (AI) text generators have gained significant traction in various industries due to their efficiency and cost-effectiveness in content creation. These solutions utilize machine learning algorithms, such as Deep Neural Networks, to analyze and learn from vast datasets of human-written text. By predicting the most probable word or sequence of words based on patterns and relationships identified In the training data, AIgenerators produce personalized content for multiple languages and global audiences. The application spans across industries, including education, manufacturing, e-commerce, and entertainment & media. In the education industry, AI generators assist in creating personalized learning materials.

Get a glance at the Artificial Intelligence (AI) Text Generator Industry report of share of various segments Request Free Sample

The solution segment was valued at USD 184.50 million in 2018 and showed a gradual increase during the forecast period.

Regional Analysis

North America is estimated to contribute 33% to the growth of the global market during the forecast period.

Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

For more insights on the market share of various regions, Request Free Sample

The North American market holds the largest share in the market, driven by the region's technological advancements and increasing adoption of AI in various industries. AI text generators are increasingly utilized for content creation, customer service, virtual assistants, and chatbots, catering to the growing demand for high-quality, personalized content in sectors such as e-commerce and digital marketing. Moreover, the presence of tech giants like Google, Microsoft, and Amazon in North America, who are investing significantly in AI and machine learning, further fuels market growth. AI generators employ Machine Learning algorithms, Deep Neural Networks, and Natural Language Processing to generate content in multiple languages for global audiences.

Market Dynamics

Our researchers analyzed the data with 2023 as the base year, along with the key drivers, trends, and c
c
Data from: LVMED: Dataset of Latvian text normalisation samples for the...
repository.clarin.lv
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
Explore at:
Dataset updated
May 30, 2023
Authors
Viesturs Jūlijs Lasmanis; Normunds Grūzītis
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
DataCI Continuous Text Classification Example Using Yelp Dataset
data.niaid.nih.gov
zenodo.org
Updated Aug 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yelp Inc. (2023). DataCI Continuous Text Classification Example Using Yelp Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8288432
Explore at:
Dataset updated
Aug 28, 2023
Dataset provided by
Yelphttp://yelp.com/
Li, Yuanming
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its date. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets:

yelp_review_train@2020-10: from 2020-09-01 to 2020-10-15

yelp_review_val@2020-10: from 2020-10-16 to 2020-10-31

yelp_review_train@2020-11: from 2020-10-01 to 2020-11-15

yelp_review_val@2020-11: from 2020-11-16 to 2020-11-30
i
Text Mining Data - SET
ieee-dataport.org
Updated Mar 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kingsley Okoye (2025). Text Mining Data - SET [Dataset]. https://ieee-dataport.org/documents/text-mining-data-set
Explore at:
Dataset updated
Mar 18, 2025
Authors
Kingsley Okoye
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Emotional classification (valence) in textual data has proved to be central to human experience analysis and natural language processing (NLP). This study implements a text mining model and algorithm - TM-EV (Text Mining for Emotional Valence Analysis) - that determines the impact of emotional valence (EV) shown by undergraduate students in their feedback (n=665860) during the program (pre- and post-course to determine its relationship with the learning outcome and performance.
NLUCat
zenodo.org
huggingface.co
+1more
zip
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10721193
Dataset updated
Mar 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
NLUCat

Dataset Description

Dataset Summary

NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

This dataset can be used to train models for intent classification, spans identification and examples generation.

This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

In this repository you'll find the following items:

NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team

NLUCat_dataset.json: the completed NLUCat dataset

NLUCat_stats.tsv: statistics about de NLUCat dataset

dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers

reports: folder with the reports done as feedback to the annotators during the annotation process

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

Supported Tasks and Leaderboards

Intent classification, spans identification and examples generation.

Languages

The dataset is in Catalan (ca-ES).

Dataset Structure

Data Instances

Three JSON files, one for each split.

Data Fields

example: `str`. Example

annotation: `dict`. Annotation of the example

intent: `str`. Intent tag

slots: `list`. List of slots

Tag:`str`. tag to the slot

Text:`str`. Text of the slot

Start_char: `int`. First character of the span

End_char: `int`. Last character of the span

Example

An example looks as follows:

{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},

Data Splits

NLUCat.train: 9128 examples

NLUCat.dev: 1441 examples

NLUCat.test: 1441 examples

Dataset Creation

Curation Rationale

We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

Source Data

Initial Data Collection and Normalization

We commissioned a company to create fictitious examples for the creation of this dataset.

Who are the source language producers?

We commissioned the writing of the examples to the company m47 labs.

Annotations

Annotation process

The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

Who are the annotators?

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

Personal and Sensitive Information

No personal or sensitive information included.

The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

Considerations for Using the Data

Social Impact of Dataset

We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

Discussion of Biases

When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

Other Known Limitations

[N/A]

Additional Information

Dataset Curators

Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

Licensing Information

This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.

Citation Information

DOI

Contributions

The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Z
A set of generated Instagram Data Download Packages (DDPs) to investigate...
data.niaid.nih.gov
Updated Jan 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Laura Boeschoten (2021). A set of generated Instagram Data Download Packages (DDPs) to investigate their structure and content [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4472605
Explore at:
Dataset updated
Jan 28, 2021
Dataset provided by
Ruben van den Goorbergh
Laura Boeschoten
Daniel Oberski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Instagram data-download example dataset

In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).

How the data was generated

These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.

The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.

Obtaining your Instagram DDP

After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.

Go to www.instagram.com and log in

Click on your profile picture, go to Settings and Privacy and Security

Scroll to Data download and click Request download

Enter your email adress and click Next

Enter your password and click Request download

Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.

Data cleaning

To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.

How this data-set can be used

This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.

Authors

The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.

Acknowledgments

The researchers would like to thank everyone who participated in this data-generation project.
P
Data from: WebText Dataset
paperswithcode.com
Updated Jul 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2022). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
Explore at:
Dataset updated
Jul 10, 2022
Authors
Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
Description
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
d
Data from: ViTexOCR; a script to extract text overlays from digital video
catalog.data.gov
data.usgs.gov
+5more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). ViTexOCR; a script to extract text overlays from digital video [Dataset]. https://catalog.data.gov/dataset/vitexocr-a-script-to-extract-text-overlays-from-digital-video
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.
H
Replication Data for: Active Learning Approaches for Labeling Text: Review...
dataverse.harvard.edu
Updated Dec 11, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/T88EAX
Dataset updated
Dec 11, 2019
Dataset provided by
Harvard Dataverse
Authors
Blake Miller; Fridolin Linder; Walter Mebane
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.
Textual Entailment Dataset
kaggle.com
Updated Dec 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Devastator (2023). Textual Entailment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/textual-entailment-dataset/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
The Devastator
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Textual Entailment Dataset

Textual Entailment Dataset with Labelled Text Pairs

By SetFit (From Huggingface) [source]

About this dataset

The SetFit/mnli dataset is a comprehensive collection of textual entailment data designed to facilitate the development and evaluation of models for natural language understanding tasks. This dataset includes three distinct files: validation.csv, train.csv, and test.csv, each containing valuable information for training and evaluating textual entailment models.

In these files, users will find various columns providing important details about the text pairs. The text1 and text2 columns indicate the first and second texts in each pair respectively, allowing researchers to analyze the relationships between these texts. Additionally, the label column provides a categorical value indicating the specific relationship between text1 and text2.

To further aid in understanding the relationships expressed by these labels, there is an accompanying label_text column that offers a human-readable representation of each categorical label. This allows practitioners to interpret and analyze the labeled data more easily.

Moreover, all three files in this dataset contain an additional index column called idx, which assists in organizing and referencing specific samples within the dataset during analysis or model development.

It's worth noting that this SetFit/mnli dataset has been carefully prepared for textual entailment tasks specifically. To ensure accurate evaluation of model performance on such tasks, researchers can leverage validation.csv as a dedicated set of samples specifically reserved for validating their models' performance during training. The train.csv file contains ample training data with corresponding labels that can be utilized to effectively train reliable textual entailment models. Lastly, test.csv includes test samples designed for evaluating model performance on textual entailment tasks.

By utilizing this extensive collection of high-quality data provided by SetFit/mnli dataset, researchers can develop powerful models capable of accurately understanding natural language relationships expressed within text pairs across various domains

How to use the dataset

text1: This column contains the first text in a pair.

text2: This column contains the second text in a pair.

label: The label column indicates the relationship between text1 and text2 using categorical values.

label_text: The label_text column provides the text representation of the labels.

To effectively use this dataset for your textual entailment task, follow these steps:

1. Understanding the Columns

Start by familiarizing yourself with the different columns present in each file of this dataset:

text1: The first text in a pair that needs to be evaluated for textual entailment.

text2: The second text in a pair that needs to be compared with text1 to determine its logical relationship.

label: This categorical field represents predefined relationships or categories between texts based on their meaning or logical inference.

label_text: A human-readable representation of each label category that helps understand their real-world implications.

2. Data Exploration

Before building models or applying any algorithms, it's essential to explore and understand your data thoroughly:

Analyze sample data points from each file (validation.csv, train.csv).

Identify any class imbalances within different labels present in your data distribution.

3. Preprocessing Steps

Handle missing values: Check if there are any missing values (NaNs) within any columns and decide how to handle them.

Text cleaning: Depending on the nature of your task, implement appropriate text cleaning techniques like removing stop words, lowercasing, punctuation removal, etc.

Tokenization: Break down the text into individual tokens or words to facilitate further processing steps.

4. Model Training and Evaluation

Once your dataset is ready for modeling:

Split your data into training and testing sets using the train.csv and test.csv files. This division allows you to train models on a subset of data while evaluating their performance on an unseen portion.

Utilize machine learning or deep learning algorithms suitable for textual entailment tasks (e.g., BERT

Research Ideas

Natural Language Understanding: The dataset can be used for training and evaluating models that perform natural language understanding tasks, such as text classification, ...
Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...
zenodo.org
data.niaid.nih.gov
zip
Updated May 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7916716
Dataset updated
May 23, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence: {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King."}

An example of ontology:

Ontology: Music Ontology

Expected Output:

{ "id": "ont_k_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", "triples": [ { "sub": "The Loco-Motion", "rel": "publication date", "obj": "01 January 1962" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Gerry Goffin" },{ "sub": "The Loco-Motion", "rel": "lyrics by", "obj": "Carole King" },] }

The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

Text2KGBench

src: the source code used for generation and evaluation, and baseline

benchmark the code used to generate the benchmark

evaluation evaluation scripts for calculating the results

baseline code for generating the baselines including prompts, sentence similarities, and LLM client.

data: the benchmark datasets and baseline data. There are two datasets: wikidata_tekgen and dbpedia_webnlg.

wikidata_tekgen Wikidata-TekGen Dataset

ontologies 10 ontologies used by this dataset

train training data

test test data

manually_verified_sentences ids of a subset of test cases manually validated

unseen_sentences new sentences that are added by the authors which are not part of Wikipedia

test unseen test unseen test sentences

ground_truth ground truth for unseen test sentences.

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

unseen prompts unseen prompts for the unseen test cases

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

unseen results results for the unseen test cases

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

dbpedia_webnlg DBpedia Dataset

ontologies 19 ontologies used by this dataset

train training data

test test data

ground_truth ground truth for the test data

baselines data related to running the baselines.

test_train_sent_similarity for each test case, 5 most similar train sentences generated using SBERT T5-XXL model.

prompts prompts corresponding to each test file

Alpaca-LoRA-13B data related to the Alpaca-LoRA model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

Vicuna-13B data related to the Vicuna-13B model

llm_responses raw LLM responses and extracted triples

eval_metrics ontology-level and aggregated evaluation results

This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

Facebook

Twitter

Click to copy link

Link copied

Cite

Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data

text-clustering-example-data

billingsmoore/text-clustering-example-data

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Nov 20, 2024

Authors

Jacob Moore

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for Dataset Name

This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

  Direct Use

This data can be loaded using the following Python code. from datasets import load_dataset

ds = load_dataset('billingsmoore/text-clustering-example-data')

It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.

Clear search

Close search

Google apps

Main menu

text-clustering-example-data

SIAM 2007 Text Mining Competition dataset

Sample Data

Sample text data

text-classification-dataset-example

AI vs Human Generated Contents

QuitNowTXT Text Messaging Library

Tutorial Package for: Text as Data in Economic Analysis

BL Newspapers sample plain-text data

Artificial Intelligence (AI) Text Generator Market Analysis North America,...

Snapshot img

Data from: LVMED: Dataset of Latvian text normalisation samples for the...

DataCI Continuous Text Classification Example Using Yelp Dataset

Text Mining Data - SET

NLUCat

NLUCat

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Annotations

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Other Known Limitations

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

A set of generated Instagram Data Download Packages (DDPs) to investigate...

Data from: WebText Dataset

Data from: ViTexOCR; a script to extract text overlays from digital video

Replication Data for: Active Learning Approaches for Labeling Text: Review...

Textual Entailment Dataset

Textual Entailment Dataset

Textual Entailment Dataset with Labelled Text Pairs

About this dataset

How to use the dataset

1. Understanding the Columns

2. Data Exploration

3. Preprocessing Steps

4. Model Training and Evaluation

Research Ideas

Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

text-clustering-example-data

billingsmoore/text-clustering-example-data