Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.
Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0
Direct Use
This data can be loaded using the following Python code. from datasets import load_dataset
ds = load_dataset('billingsmoore/text-clustering-example-data')
It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
http://researchdatafinder.qut.edu.au/display/n124876http://researchdatafinder.qut.edu.au/display/n124876
Sample data of text QUT Research Data Respository Dataset Resource available for download
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is just a sample text data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
cwchang/text-classification-dataset-example dataset hosted on Hugging Face and contributed by the HF Datasets community
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A dataset with 10 text samples. Each sample is labeled as either AI-generated (1) or human-generated (0). This dataset is suitable for text classification tasks such as detecting AI-generated content.
This file contains text samples that are either generated by AI models or written by humans. Each entry is labeled to indicate whether the content is AI-generated or human-generated. This dataset can be used for various natural language processing tasks such as text classification, content analysis, and AI content detection. ** Column 1: text** Description: "The actual content (text data), which may be a short paragraph or sentence. This is the primary feature for analysis." Data Type: String (Text) Column 2: label Description: "Binary label indicating whether the content is AI-generated or human-generated. '0' represents human-generated, and '1' represents AI-generated." Data Type: Integer (0 or 1)
The AI-generated content was created using advanced language models such as GPT-4, which were instructed to write text on various topics. The human-generated content was sourced from publicly available texts, including articles, blogs, and creative writing samples found on the internet. Care has been taken to ensure that all human-generated content is in the public domain or shared with permission, without any identifiable information
This dataset is static and will not receive regular updates. However, future versions may be released if new data becomes available or if users contribute additional examples to enhance the dataset.
Overview: The QuitNowTXT text messaging program is designed as a resource that can be adapted to specific contexts including those outside the United States and in languages other than English. Based on evidence-based practices, this program is a smoking cessation intervention for smokers who are ready to quit smoking. Although evidence supports the use of text messaging as a platform to deliver cessation interventions, it is expected that the maximum effect of the program will be demonstrated when it is integrated into other elements of a national tobacco control strategy. The QuitNowTXT program is designed to deliver tips, motivation, encouragement and fact-based information via unidirectional and interactive bidirectional message formats. The core of the program consists of messages sent to the user based on a scheduled quit day identified by the user. Messages are sent for up to four weeks pre-quit date and up to six weeks post quit date. Messages assessing mood, craving, and smoking status are also sent at various intervals, and the user receives messages back based on the response they have submitted. In addition, users can request assistance in dealing with craving, stress/mood, and responding to slips/relapses by texting specific key words to the QuitNow. Rotating automated messages are then returned to the user based on the keyword. Details of the program are provided below. Texting STOP to the service discontinues further texts being sent. This option is provided every few messages as required by the United States cell phone providers. It is not an option to remove this feature if the program is used within the US. If a web-based registration is used, it is suggested that users provide demographic information such as age, sex, and smoking frequency (daily or almost every day, most days, only a few days a week, only on weekends, a few times a month or less) in addition to their mobile phone number and quit date. This information will be useful for assessing the reach of the program, as well as identifying possible need to develop libraries to specific groups. The use of only a mobile phone-based registration system reduces barriers for participant entry into the program but limits the collection of additional data. At bare minimum, quit date must be collected. At sign up, participants will have the option to choose a quit date up to one month out. Text messages will start up to 14 days before their specified quit date. Users also have the option of changing their quit date at any time if desired. The program can also be modified to provide texts to users who have already quit within the last month. One possible adaptation of the program is to include a QuitNowTXT "light" version. This adaptation would allow individuals who do not have unlimited text messaging capabilities but would still like to receive support to participate by controlling the number of messages they receive. In the light program, users can text any of the programmed keywords without fully opting in to the program. Program Design: The program is designed as a 14-day countdown to quit date, with subsequent six weeks of daily messages. Each day within the program is identified as either a pre-quit date (Q- # days) or a post-quit date (Q+#). If a user opts into the program fewer than 14 days before their quit date, the system will begin sending messages on that day. For example, if they opt in four days prior to their quit date, the system will send a welcome message and recognize that they are at Q-4 (or four days before their quit date), and they will receive the message that everyone else receives four days before their quit date. As the user progresses throughout the program, they will receive messages outlined in the text message library. Throughout the program, users will receive texts that cover a variety of content areas including tips, informational content, motivational messaging, and keyword responses. The frequency of messages increases as the days leading up to and following the user's quit date, with a heavy emphasis on support, efficacy building, and actionable tips. Further away from a user's quit date, the messages will reduce in frequency. If the user says they have started to smoke again, the system will give them the option of continuing the program as planned or starting over and setting a new quit date. The system is also designed to assess the user's mood, craving level, and smoking status several times during the program. These assessment messages are characterized as MOOD, CRAVE, and STATUS messages. Whenever the system asks for a response from the user, it will send a programmed response based on the user's answer (i.e., if the user responds with MOOD = BAD then they will receive a message customized to that response). These programmed response messages rotate throughout the course of the program. Users can also send the system one of three programmed keywords (CRAVE, MOOD, and SLIP), and the system will send unique, automated responses based on the texted keyword. There are 10 messages for each of the programmed keywords, which rotate on a random basis, decreasing the likelihood the user will get the same response in a row. After the full six-week program comes to an end, the system will follow up at one, three, and six months to check on the user's smokefree status and offer additional assistance if needed. Message Types: -'''¢ Tips: Tips provide users with actionable strategies on how to manage cravings and deal with quitting smoking in general. -'''¢ Motivation/encouragement: Motivational messages encourage users to keep going on their smokefree journey despite the difficulty and struggle they may be facing. -'''¢ Information: Informational messages provide users with facts and other salient points about the impact of smoking relevant to their socio-cultural environment. -'''¢ Assessment: The assessment messages are built into the text messaging program and are designed to collect information about the user's experience as they are quitting and provide immediate feedback based on the user's response. Assessment messages fall along three dimensions: mood, craving, and smoking status. -'''¢ Reactive Messaging (Key Words): At any point, the user can initiate an interaction with the program that will return a text message relevant to the user's request for help. In response to the user texting one of the key words, the system will send them unique, automated responses. The key words cover topics relevant to various aspects of cessation.
Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
License information was derived automatically
This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A dataset of .csv files each containing article texts from newspapers published on the Shared Research Repository.
Artificial Intelligence Text Generator Market Size 2024-2028
The artificial intelligence (AI) text generator market size is forecast to increase by USD 908.2 million at a CAGR of 21.22% between 2023 and 2028.
The market is experiencing significant growth due to several key trends. One of these trends is the increasing popularity of AI generators in various sectors, including education for e-learning applications. Another trend is the growing importance of speech-to-text technology, which is becoming increasingly essential for improving productivity and accessibility. However, data privacy and security concerns remain a challenge for the market, as generators process and store vast amounts of sensitive information. It is crucial for market participants to address these concerns through strong data security measures and transparent data handling practices to ensure customer trust and compliance with regulations. Overall, the AI generator market is poised for continued growth as it offers significant benefits in terms of efficiency, accuracy, and accessibility.
What will be the Size of the Artificial Intelligence (AI) Text Generator Market During the Forecast Period?
Request Free Sample
The market is experiencing significant growth as businesses and organizations seek to automate content creation across various industries. Driven by technological advancements in machine learning (ML) and natural language processing, AI generators are increasingly being adopted for downstream applications in sectors such as education, manufacturing, and e-commerce.
Moreover, these systems enable the creation of personalized content for global audiences in multiple languages, providing a competitive edge for businesses in an interconnected Internet economy. However, responsible AI practices are crucial to mitigate risks associated with biased content, misinformation, misuse, and potential misrepresentation.
How is this Artificial Intelligence (AI) Text Generator Industry segmented and which is the largest segment?
The artificial intelligence (AI) text generator industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.
Component
Solution
Service
Application
Text to text
Speech to text
Image/video to text
Geography
North America
US
Europe
Germany
UK
APAC
China
India
South America
Middle East and Africa
By Component Insights
The solution segment is estimated to witness significant growth during the forecast period.
Artificial Intelligence (AI) text generators have gained significant traction in various industries due to their efficiency and cost-effectiveness in content creation. These solutions utilize machine learning algorithms, such as Deep Neural Networks, to analyze and learn from vast datasets of human-written text. By predicting the most probable word or sequence of words based on patterns and relationships identified In the training data, AIgenerators produce personalized content for multiple languages and global audiences. The application spans across industries, including education, manufacturing, e-commerce, and entertainment & media. In the education industry, AI generators assist in creating personalized learning materials.
Get a glance at the Artificial Intelligence (AI) Text Generator Industry report of share of various segments Request Free Sample
The solution segment was valued at USD 184.50 million in 2018 and showed a gradual increase during the forecast period.
Regional Analysis
North America is estimated to contribute 33% to the growth of the global market during the forecast period.
Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.
For more insights on the market share of various regions, Request Free Sample
The North American market holds the largest share in the market, driven by the region's technological advancements and increasing adoption of AI in various industries. AI text generators are increasingly utilized for content creation, customer service, virtual assistants, and chatbots, catering to the growing demand for high-quality, personalized content in sectors such as e-commerce and digital marketing. Moreover, the presence of tech giants like Google, Microsoft, and Amazon in North America, who are investing significantly in AI and machine learning, further fuels market growth. AI generators employ Machine Learning algorithms, Deep Neural Networks, and Natural Language Processing to generate content in multiple languages for global audiences.
Market Dynamics
Our researchers analyzed the data with 2023 as the base year, along with the key drivers, trends, and c
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).
Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.
All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its date
. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets:
yelp_review_train@2020-10
: from 2020-09-01 to 2020-10-15
yelp_review_val@2020-10
: from 2020-10-16 to 2020-10-31
yelp_review_train@2020-11
: from 2020-10-01 to 2020-11-15
yelp_review_val@2020-11
: from 2020-11-16 to 2020-11-30
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Emotional classification (valence) in textual data has proved to be central to human experience analysis and natural language processing (NLP). This study implements a text mining model and algorithm - TM-EV (Text Mining for Emotional Valence Analysis) - that determines the impact of emotional valence (EV) shown by undergraduate students in their feedback (n=665860) during the program (pre- and post-course to determine its relationship with the learning outcome and performance.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.
The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).
The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.
The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)
This dataset can be used to train models for intent classification, spans identification and examples generation.
This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.
In this repository you'll find the following items:
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.
Intent classification, spans identification and examples generation.
The dataset is in Catalan (ca-ES).
Three JSON files, one for each split.
Example
An example looks as follows:
{
"example": "Demana una ambulància; la meva dona està de part.",
"annotation": {
"intent": "call_emergency",
"slots": [
{
"Tag": "service",
"Text": "ambulància",
"Start_char": 11,
"End_char": 21
},
{
"Tag": "situation",
"Text": "la meva dona està de part",
"Start_char": 23,
"End_char": 48
}
]
}
},
We created this dataset to contribute to the development of language models in Catalan, a low-resource language.
When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.
Initial Data Collection and Normalization
We commissioned a company to create fictitious examples for the creation of this dataset.
Who are the source language producers?
We commissioned the writing of the examples to the company m47 labs.
Annotation process
The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
* First step: translation or elaboration of the instructions given to the annotators to write the examples.
* Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
* Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.
Who are the annotators?
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
No personal or sensitive information included.
The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.
We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.
When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.
[N/A]
Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)
This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.
This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
Give appropriate credit, provide a link to the license, and indicate if changes were made.
The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Instagram data-download example dataset
In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).
How the data was generated
These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.
The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.
Obtaining your Instagram DDP
After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.
Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.
Data cleaning
To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.
How this data-set can be used
This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.
Authors
The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.
Acknowledgments
The researchers would like to thank everyone who participated in this data-generation project.
WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.
The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By SetFit (From Huggingface) [source]
The SetFit/mnli dataset is a comprehensive collection of textual entailment data designed to facilitate the development and evaluation of models for natural language understanding tasks. This dataset includes three distinct files: validation.csv, train.csv, and test.csv, each containing valuable information for training and evaluating textual entailment models.
In these files, users will find various columns providing important details about the text pairs. The text1 and text2 columns indicate the first and second texts in each pair respectively, allowing researchers to analyze the relationships between these texts. Additionally, the label column provides a categorical value indicating the specific relationship between text1 and text2.
To further aid in understanding the relationships expressed by these labels, there is an accompanying label_text column that offers a human-readable representation of each categorical label. This allows practitioners to interpret and analyze the labeled data more easily.
Moreover, all three files in this dataset contain an additional index column called idx, which assists in organizing and referencing specific samples within the dataset during analysis or model development.
It's worth noting that this SetFit/mnli dataset has been carefully prepared for textual entailment tasks specifically. To ensure accurate evaluation of model performance on such tasks, researchers can leverage validation.csv as a dedicated set of samples specifically reserved for validating their models' performance during training. The train.csv file contains ample training data with corresponding labels that can be utilized to effectively train reliable textual entailment models. Lastly, test.csv includes test samples designed for evaluating model performance on textual entailment tasks.
By utilizing this extensive collection of high-quality data provided by SetFit/mnli dataset, researchers can develop powerful models capable of accurately understanding natural language relationships expressed within text pairs across various domains
- text1: This column contains the first text in a pair.
- text2: This column contains the second text in a pair.
- label: The label column indicates the relationship between text1 and text2 using categorical values.
- label_text: The label_text column provides the text representation of the labels.
To effectively use this dataset for your textual entailment task, follow these steps:
1. Understanding the Columns
Start by familiarizing yourself with the different columns present in each file of this dataset:
- text1: The first text in a pair that needs to be evaluated for textual entailment.
- text2: The second text in a pair that needs to be compared with text1 to determine its logical relationship.
- label: This categorical field represents predefined relationships or categories between texts based on their meaning or logical inference.
- label_text: A human-readable representation of each label category that helps understand their real-world implications.
2. Data Exploration
Before building models or applying any algorithms, it's essential to explore and understand your data thoroughly:
- Analyze sample data points from each file (validation.csv, train.csv).
- Identify any class imbalances within different labels present in your data distribution.
3. Preprocessing Steps
- Handle missing values: Check if there are any missing values (NaNs) within any columns and decide how to handle them.
- Text cleaning: Depending on the nature of your task, implement appropriate text cleaning techniques like removing stop words, lowercasing, punctuation removal, etc.
- Tokenization: Break down the text into individual tokens or words to facilitate further processing steps.
4. Model Training and Evaluation
Once your dataset is ready for modeling:
- Split your data into training and testing sets using the train.csv and test.csv files. This division allows you to train models on a subset of data while evaluating their performance on an unseen portion.
- Utilize machine learning or deep learning algorithms suitable for textual entailment tasks (e.g., BERT
- Natural Language Understanding: The dataset can be used for training and evaluating models that perform natural language understanding tasks, such as text classification, ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text
. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.
It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.
An example
An example test sentence:
Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by
American songwriters Gerry Goffin and Carole King."}
An example of ontology:
Ontology: Music Ontology
Expected Output:
{
"id": "ont_k_music_test_n",
"sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.",
"triples": [
{
"sub": "The Loco-Motion",
"rel": "publication date",
"obj": "01 January 1962"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Gerry Goffin"
},{
"sub": "The Loco-Motion",
"rel": "lyrics by",
"obj": "Carole King"
},]
}
The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.
The structure of the repo is as the following.
benchmark
the code used to generate the benchmarkevaluation
evaluation scripts for calculating the resultsThis benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.
[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.
[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Card for Dataset Name
This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.
Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0
Direct Use
This data can be loaded using the following Python code. from datasets import load_dataset
ds = load_dataset('billingsmoore/text-clustering-example-data')
It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.