100+ datasets found
  1. h

    text-clustering-example-data

    • huggingface.co
    Updated Nov 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 20, 2024
    Authors
    Jacob Moore
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Card for Dataset Name

    This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

    Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

      Direct Use
    

    This data can be loaded using the following Python code. from datasets import load_dataset

    ds = load_dataset('billingsmoore/text-clustering-example-data')

    It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.

  2. d

    SIAM 2007 Text Mining Competition dataset

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Apr 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
    Explore at:
    Dataset updated
    Apr 11, 2025
    Dataset provided by
    Dashlink
    Description

    Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.

  3. q

    Sample Data

    • data.researchdatafinder.qut.edu.au
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sample Data [Dataset]. https://data.researchdatafinder.qut.edu.au/dataset/discriminate-short-text2/resource/ffcc10ed-7592-4474-8e62-2de30002c845
    Explore at:
    Dataset updated
    May 31, 2023
    License

    http://researchdatafinder.qut.edu.au/display/n124876http://researchdatafinder.qut.edu.au/display/n124876

    Description

    Sample data of text QUT Research Data Respository Dataset Resource available for download

  4. Sample text data

    • zenodo.org
    txt
    Updated May 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paul Kilgarriff; Paul Kilgarriff (2023). Sample text data [Dataset]. http://doi.org/10.5281/zenodo.7944136
    Explore at:
    txtAvailable download formats
    Dataset updated
    May 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Paul Kilgarriff; Paul Kilgarriff
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is just a sample text data.

  5. h

    text-classification-dataset-example

    • huggingface.co
    Updated Feb 7, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chien-Wei Chang (2024). text-classification-dataset-example [Dataset]. https://huggingface.co/datasets/cwchang/text-classification-dataset-example
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 7, 2024
    Authors
    Chien-Wei Chang
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    cwchang/text-classification-dataset-example dataset hosted on Hugging Face and contributed by the HF Datasets community

  6. AI vs Human Generated Contents

    • kaggle.com
    Updated Oct 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Asfaq Ahmed 456 (2024). AI vs Human Generated Contents [Dataset]. https://www.kaggle.com/datasets/asfaqahmed456/ai-vs-human-generated-contents
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 6, 2024
    Dataset provided by
    Kaggle
    Authors
    Asfaq Ahmed 456
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A dataset with 10 text samples. Each sample is labeled as either AI-generated (1) or human-generated (0). This dataset is suitable for text classification tasks such as detecting AI-generated content.

    This file contains text samples that are either generated by AI models or written by humans. Each entry is labeled to indicate whether the content is AI-generated or human-generated. This dataset can be used for various natural language processing tasks such as text classification, content analysis, and AI content detection. ** Column 1: text** Description: "The actual content (text data), which may be a short paragraph or sentence. This is the primary feature for analysis." Data Type: String (Text) Column 2: label Description: "Binary label indicating whether the content is AI-generated or human-generated. '0' represents human-generated, and '1' represents AI-generated." Data Type: Integer (0 or 1)

    The AI-generated content was created using advanced language models such as GPT-4, which were instructed to write text on various topics. The human-generated content was sourced from publicly available texts, including articles, blogs, and creative writing samples found on the internet. Care has been taken to ensure that all human-generated content is in the public domain or shared with permission, without any identifiable information

    This dataset is static and will not receive regular updates. However, future versions may be released if new data becomes available or if users contribute additional examples to enhance the dataset.

  7. QuitNowTXT Text Messaging Library

    • healthdata.gov
    • data.virginia.gov
    • +2more
    application/rdfxml +5
    Updated Feb 13, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). QuitNowTXT Text Messaging Library [Dataset]. https://healthdata.gov/w/ks37-e557/default?cur=4E5z_TlScUJ
    Explore at:
    tsv, csv, application/rssxml, xml, json, application/rdfxmlAvailable download formats
    Dataset updated
    Feb 13, 2021
    Description

    Overview: The QuitNowTXT text messaging program is designed as a resource that can be adapted to specific contexts including those outside the United States and in languages other than English. Based on evidence-based practices, this program is a smoking cessation intervention for smokers who are ready to quit smoking. Although evidence supports the use of text messaging as a platform to deliver cessation interventions, it is expected that the maximum effect of the program will be demonstrated when it is integrated into other elements of a national tobacco control strategy. The QuitNowTXT program is designed to deliver tips, motivation, encouragement and fact-based information via unidirectional and interactive bidirectional message formats. The core of the program consists of messages sent to the user based on a scheduled quit day identified by the user. Messages are sent for up to four weeks pre-quit date and up to six weeks post quit date. Messages assessing mood, craving, and smoking status are also sent at various intervals, and the user receives messages back based on the response they have submitted. In addition, users can request assistance in dealing with craving, stress/mood, and responding to slips/relapses by texting specific key words to the QuitNow. Rotating automated messages are then returned to the user based on the keyword. Details of the program are provided below. Texting STOP to the service discontinues further texts being sent. This option is provided every few messages as required by the United States cell phone providers. It is not an option to remove this feature if the program is used within the US. If a web-based registration is used, it is suggested that users provide demographic information such as age, sex, and smoking frequency (daily or almost every day, most days, only a few days a week, only on weekends, a few times a month or less) in addition to their mobile phone number and quit date. This information will be useful for assessing the reach of the program, as well as identifying possible need to develop libraries to specific groups. The use of only a mobile phone-based registration system reduces barriers for participant entry into the program but limits the collection of additional data. At bare minimum, quit date must be collected. At sign up, participants will have the option to choose a quit date up to one month out. Text messages will start up to 14 days before their specified quit date. Users also have the option of changing their quit date at any time if desired. The program can also be modified to provide texts to users who have already quit within the last month. One possible adaptation of the program is to include a QuitNowTXT "light" version. This adaptation would allow individuals who do not have unlimited text messaging capabilities but would still like to receive support to participate by controlling the number of messages they receive. In the light program, users can text any of the programmed keywords without fully opting in to the program. Program Design: The program is designed as a 14-day countdown to quit date, with subsequent six weeks of daily messages. Each day within the program is identified as either a pre-quit date (Q- # days) or a post-quit date (Q+#). If a user opts into the program fewer than 14 days before their quit date, the system will begin sending messages on that day. For example, if they opt in four days prior to their quit date, the system will send a welcome message and recognize that they are at Q-4 (or four days before their quit date), and they will receive the message that everyone else receives four days before their quit date. As the user progresses throughout the program, they will receive messages outlined in the text message library. Throughout the program, users will receive texts that cover a variety of content areas including tips, informational content, motivational messaging, and keyword responses. The frequency of messages increases as the days leading up to and following the user's quit date, with a heavy emphasis on support, efficacy building, and actionable tips. Further away from a user's quit date, the messages will reduce in frequency. If the user says they have started to smoke again, the system will give them the option of continuing the program as planned or starting over and setting a new quit date. The system is also designed to assess the user's mood, craving level, and smoking status several times during the program. These assessment messages are characterized as MOOD, CRAVE, and STATUS messages. Whenever the system asks for a response from the user, it will send a programmed response based on the user's answer (i.e., if the user responds with MOOD = BAD then they will receive a message customized to that response). These programmed response messages rotate throughout the course of the program. Users can also send the system one of three programmed keywords (CRAVE, MOOD, and SLIP), and the system will send unique, automated responses based on the texted keyword. There are 10 messages for each of the programmed keywords, which rotate on a random basis, decreasing the likelihood the user will get the same response in a row. After the full six-week program comes to an end, the system will follow up at one, three, and six months to check on the user's smokefree status and offer additional assistance if needed. Message Types: -'''¢ Tips: Tips provide users with actionable strategies on how to manage cravings and deal with quitting smoking in general. -'''¢ Motivation/encouragement: Motivational messages encourage users to keep going on their smokefree journey despite the difficulty and struggle they may be facing. -'''¢ Information: Informational messages provide users with facts and other salient points about the impact of smoking relevant to their socio-cultural environment. -'''¢ Assessment: The assessment messages are built into the text messaging program and are designed to collect information about the user's experience as they are quitting and provide immediate feedback based on the user's response. Assessment messages fall along three dimensions: mood, craving, and smoking status. -'''¢ Reactive Messaging (Key Words): At any point, the user can initiate an interaction with the program that will return a text message relevant to the user's request for help. In response to the user texting one of the key words, the system will send them unique, automated responses. The key words cover topics relevant to various aspects of cessation.

  8. D

    Tutorial Package for: Text as Data in Economic Analysis

    • dataverse.nl
    Updated Jun 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun (2025). Tutorial Package for: Text as Data in Economic Analysis [Dataset]. http://doi.org/10.34894/KNDZ9T
    Explore at:
    text/markdown(148), bin(493802528), text/markdown(405), csv(6678744), application/x-ipynb+json(56525), text/markdown(136), csv(8712017), txt(1706), text/x-python(3800), text/markdown(131), txt(194), text/markdown(179), csv(89054804), bin(43909246), csv(1600), xlsx(10436), bin(952), text/markdown(1743)Available download formats
    Dataset updated
    Jun 26, 2025
    Dataset provided by
    DataverseNL
    Authors
    Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun; Tarek Hassan; Stephan Hollander; Aakash Kalyani; Laurence Van Lent; Markus Schwedeler; Ahmed Tahoun
    License

    Attribution-NoDerivs 4.0 (CC BY-ND 4.0)https://creativecommons.org/licenses/by-nd/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2002 - May 31, 2023
    Dataset funded by
    Institute for New Economic Thinking
    Deutsche Forschungsgemeinschaft (403041268-TRR 266)
    Description

    This tutorial package, comprising both data and code, accompanies the article and is designed primarily to allow readers to explore the various vocabulary-building methods discussed in the paper. The article discusses how to apply computational linguistics techniques to analyze largely unstructured corporate-generated text for economic analysis. As a core example, we illustrate how textual analysis of earnings conference call transcripts can provide insights into how markets and individual firms respond to economic shocks, such as a nuclear disaster or a geopolitical event: insights that often elude traditional non-text data sources. This approach enables extracting actionable intelligence, supporting both policy-making and strategic corporate decision-making. We also explore applications using other sources of corporate-generated text, including patent documents and job postings. By incorporating computational linguistics techniques into the analysis of economic shocks, new opportunities arise for real-time economic data, offering a more nuanced understanding of market and firm responses in times of economic volatility.

  9. BL Newspapers sample plain-text data

    • zenodo.org
    zip
    Updated Aug 19, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yann Ryan; Yann Ryan (2023). BL Newspapers sample plain-text data [Dataset]. http://doi.org/10.5281/zenodo.8262356
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 19, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Yann Ryan; Yann Ryan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A dataset of .csv files each containing article texts from newspapers published on the Shared Research Repository.

  10. Artificial Intelligence (AI) Text Generator Market Analysis North America,...

    • technavio.com
    Updated Jul 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2024). Artificial Intelligence (AI) Text Generator Market Analysis North America, Europe, APAC, South America, Middle East and Africa - US, UK, China, India, Germany - Size and Forecast 2024-2028 [Dataset]. https://www.technavio.com/report/ai-text-generator-market-analysis
    Explore at:
    Dataset updated
    Jul 15, 2024
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    United States, Global
    Description

    Snapshot img

    Artificial Intelligence Text Generator Market Size 2024-2028

    The artificial intelligence (AI) text generator market size is forecast to increase by USD 908.2 million at a CAGR of 21.22% between 2023 and 2028.

    The market is experiencing significant growth due to several key trends. One of these trends is the increasing popularity of AI generators in various sectors, including education for e-learning applications. Another trend is the growing importance of speech-to-text technology, which is becoming increasingly essential for improving productivity and accessibility. However, data privacy and security concerns remain a challenge for the market, as generators process and store vast amounts of sensitive information. It is crucial for market participants to address these concerns through strong data security measures and transparent data handling practices to ensure customer trust and compliance with regulations. Overall, the AI generator market is poised for continued growth as it offers significant benefits in terms of efficiency, accuracy, and accessibility.
    

    What will be the Size of the Artificial Intelligence (AI) Text Generator Market During the Forecast Period?

    Request Free Sample

    The market is experiencing significant growth as businesses and organizations seek to automate content creation across various industries. Driven by technological advancements in machine learning (ML) and natural language processing, AI generators are increasingly being adopted for downstream applications in sectors such as education, manufacturing, and e-commerce. 
    Moreover, these systems enable the creation of personalized content for global audiences in multiple languages, providing a competitive edge for businesses in an interconnected Internet economy. However, responsible AI practices are crucial to mitigate risks associated with biased content, misinformation, misuse, and potential misrepresentation.
    

    How is this Artificial Intelligence (AI) Text Generator Industry segmented and which is the largest segment?

    The artificial intelligence (AI) text generator industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2024-2028, as well as historical data from 2018-2022 for the following segments.

    Component
    
      Solution
      Service
    
    
    Application
    
      Text to text
      Speech to text
      Image/video to text
    
    
    Geography
    
      North America
    
        US
    
    
      Europe
    
        Germany
        UK
    
    
      APAC
    
        China
        India
    
    
      South America
    
    
    
      Middle East and Africa
    

    By Component Insights

    The solution segment is estimated to witness significant growth during the forecast period.
    

    Artificial Intelligence (AI) text generators have gained significant traction in various industries due to their efficiency and cost-effectiveness in content creation. These solutions utilize machine learning algorithms, such as Deep Neural Networks, to analyze and learn from vast datasets of human-written text. By predicting the most probable word or sequence of words based on patterns and relationships identified In the training data, AIgenerators produce personalized content for multiple languages and global audiences. The application spans across industries, including education, manufacturing, e-commerce, and entertainment & media. In the education industry, AI generators assist in creating personalized learning materials.

    Get a glance at the Artificial Intelligence (AI) Text Generator Industry report of share of various segments Request Free Sample

    The solution segment was valued at USD 184.50 million in 2018 and showed a gradual increase during the forecast period.

    Regional Analysis

    North America is estimated to contribute 33% to the growth of the global market during the forecast period.
    

    Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.

    For more insights on the market share of various regions, Request Free Sample

    The North American market holds the largest share in the market, driven by the region's technological advancements and increasing adoption of AI in various industries. AI text generators are increasingly utilized for content creation, customer service, virtual assistants, and chatbots, catering to the growing demand for high-quality, personalized content in sectors such as e-commerce and digital marketing. Moreover, the presence of tech giants like Google, Microsoft, and Amazon in North America, who are investing significantly in AI and machine learning, further fuels market growth. AI generators employ Machine Learning algorithms, Deep Neural Networks, and Natural Language Processing to generate content in multiple languages for global audiences.

    Market Dynamics

    Our researchers analyzed the data with 2023 as the base year, along with the key drivers, trends, and c

  11. c

    Data from: LVMED: Dataset of Latvian text normalisation samples for the...

    • repository.clarin.lv
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis (2023). LVMED: Dataset of Latvian text normalisation samples for the medical domain [Dataset]. https://repository.clarin.lv/repository/xmlui/handle/20.500.12574/85
    Explore at:
    Dataset updated
    May 30, 2023
    Authors
    Viesturs Jūlijs Lasmanis; Normunds Grūzītis
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The CSV dataset contains sentence pairs for a text-to-text transformation task: given a sentence that contains 0..n abbreviations, rewrite (normalize) the sentence in full words (word forms).

    Training dataset: 64,665 sentence pairs Validation dataset: 7,185 sentence pairs. Testing dataset: 7,984 sentence pairs.

    All sentences are extracted from a public web corpus (https://korpuss.lv/id/Tīmeklis2020) and contain at least one medical term.

  12. DataCI Continuous Text Classification Example Using Yelp Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yelp Inc. (2023). DataCI Continuous Text Classification Example Using Yelp Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8288432
    Explore at:
    Dataset updated
    Aug 28, 2023
    Dataset provided by
    Yelphttp://yelp.com/
    Li, Yuanming
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We are using the Yelp Review Dataset as the streaming data source for the DataCI example. We have processed the Yelp review dataset into a daily-based dataset by its date. In this dataset, we will only use the data from 2020-09-01 to 2020-11-30 to simulate the streaming data scenario. We are downloading two versions of the training and validation datasets:

    yelp_review_train@2020-10: from 2020-09-01 to 2020-10-15

    yelp_review_val@2020-10: from 2020-10-16 to 2020-10-31

    yelp_review_train@2020-11: from 2020-10-01 to 2020-11-15

    yelp_review_val@2020-11: from 2020-11-16 to 2020-11-30

  13. i

    Text Mining Data - SET

    • ieee-dataport.org
    Updated Mar 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kingsley Okoye (2025). Text Mining Data - SET [Dataset]. https://ieee-dataport.org/documents/text-mining-data-set
    Explore at:
    Dataset updated
    Mar 18, 2025
    Authors
    Kingsley Okoye
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Emotional classification (valence) in textual data has proved to be central to human experience analysis and natural language processing (NLP). This study implements a text mining model and algorithm - TM-EV (Text Mining for Emotional Valence Analysis) - that determines the impact of emotional valence (EV) shown by undergraduate students in their feedback (n=665860) during the program (pre- and post-course to determine its relationship with the learning outcome and performance.

  14. NLUCat

    • zenodo.org
    • huggingface.co
    • +1more
    zip
    Updated Mar 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2024). NLUCat [Dataset]. http://doi.org/10.5281/zenodo.10721193
    Explore at:
    zipAvailable download formats
    Dataset updated
    Mar 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    NLUCat

    Dataset Description

    Dataset Summary

    NLUCat is a dataset of NLU in Catalan. It consists of nearly 12,000 instructions annotated with the most relevant intents and spans. Each instruction is accompanied, in addition, by the instructions received by the annotator who wrote it.

    The intents taken into account are the habitual ones of a virtual home assistant (activity calendar, IOT, list management, leisure, etc.), but specific ones have also been added to take into account social and healthcare needs for vulnerable people (information on administrative procedures, menu and medication reminders, etc.).

    The spans have been annotated with a tag describing the type of information they contain. They are fine-grained, but can be easily grouped to use them in robust systems.

    The examples are not only written in Catalan, but they also take into account the geographical and cultural reality of the speakers of this language (geographic points, cultural references, etc.)

    This dataset can be used to train models for intent classification, spans identification and examples generation.

    This is the complete version of the dataset. A version prepared to train and evaluate intent classifiers has been published in HuggingFace.

    In this repository you'll find the following items:

    • NLUCat_annotation_guidelines.docx: the guidelines provided to the annotation team
    • NLUCat_dataset.json: the completed NLUCat dataset
    • NLUCat_stats.tsv: statistics about de NLUCat dataset
    • dataset: folder with the dataset as published in HuggingFace, splited and prepared for training and evaluating intent classifiers
    • reports: folder with the reports done as feedback to the annotators during the annotation process

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0. Give appropriate credit , provide a link to the license, and indicate if changes were made.

    Supported Tasks and Leaderboards

    Intent classification, spans identification and examples generation.

    Languages

    The dataset is in Catalan (ca-ES).

    Dataset Structure

    Data Instances

    Three JSON files, one for each split.

    Data Fields

    • example: `str`. Example
    • annotation: `dict`. Annotation of the example
    • intent: `str`. Intent tag
    • slots: `list`. List of slots
    • Tag:`str`. tag to the slot
    • Text:`str`. Text of the slot
    • Start_char: `int`. First character of the span
    • End_char: `int`. Last character of the span

    Example


    An example looks as follows:

    {
    "example": "Demana una ambulància; la meva dona està de part.",
    "annotation": {
    "intent": "call_emergency",
    "slots": [
    {
    "Tag": "service",
    "Text": "ambulància",
    "Start_char": 11,
    "End_char": 21
    },
    {
    "Tag": "situation",
    "Text": "la meva dona està de part",
    "Start_char": 23,
    "End_char": 48
    }
    ]
    }
    },


    Data Splits

    • NLUCat.train: 9128 examples
    • NLUCat.dev: 1441 examples
    • NLUCat.test: 1441 examples

    Dataset Creation

    Curation Rationale

    We created this dataset to contribute to the development of language models in Catalan, a low-resource language.

    When creating this dataset, we took into account not only the language but the entire socio-cultural reality of the Catalan-speaking population. Special consideration was also given to the needs of the vulnerable population.

    Source Data

    Initial Data Collection and Normalization

    We commissioned a company to create fictitious examples for the creation of this dataset.

    Who are the source language producers?

    We commissioned the writing of the examples to the company m47 labs.

    Annotations

    Annotation process

    The elaboration of this dataset has been done in three steps, taking as a model the process followed by the NLU-Evaluation-Data dataset, as explained in the paper.
    * First step: translation or elaboration of the instructions given to the annotators to write the examples.
    * Second step: writing the examples. This step also includes the grammatical correction and normalization of the texts.
    * Third step: recording the attempts and the slots of each example. In this step, some modifications were made to the annotation guides to adjust them to the real situations.

    Who are the annotators?

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

    Personal and Sensitive Information

    No personal or sensitive information included.

    The examples used for the preparation of this dataset are fictitious and, therefore, the information shown is not real.

    Considerations for Using the Data

    Social Impact of Dataset

    We hope that this dataset will help the development of virtual assistants in Catalan, a language that is often not taken into account, and that it will especially help to improve the quality of life of people with special needs.

    Discussion of Biases

    When writing the examples, the annotators were asked to take into account the socio-cultural reality (geographic points, artists and cultural references, etc.) of the Catalan-speaking population.
    Likewise, they were asked to be careful to avoid examples that reinforce the stereotypes that exist in this society. For example: be careful with the gender or origin of personal names that are associated with certain activities.

    Other Known Limitations

    [N/A]

    Additional Information

    Dataset Curators

    Language Technologies Unit at the Barcelona Supercomputing Center (langtech@bsc.es)

    This work has been promoted and financed by the Generalitat de Catalunya through the Aina project.

    Licensing Information

    This dataset can be used for any purpose, whether academic or commercial, under the terms of the CC BY 4.0.
    Give appropriate credit, provide a link to the license, and indicate if changes were made.

    Citation Information

    DOI

    Contributions

    The drafting of the examples and their annotation was entrusted to the company m47 labs through a public tender process.

  15. Z

    A set of generated Instagram Data Download Packages (DDPs) to investigate...

    • data.niaid.nih.gov
    Updated Jan 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laura Boeschoten (2021). A set of generated Instagram Data Download Packages (DDPs) to investigate their structure and content [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4472605
    Explore at:
    Dataset updated
    Jan 28, 2021
    Dataset provided by
    Ruben van den Goorbergh
    Laura Boeschoten
    Daniel Oberski
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Instagram data-download example dataset

    In this repository you can find a data-set consisting of 11 personal Instagram archives, or Data-Download Packages (DDPs).

    How the data was generated

    These Instagram accounts were all new and generated by a group of researchers who were interested to figure out in detail the structure and variety in structure of these Instagram DDPs. The participants user the Instagram account extensively for approximately a week. The participants also intensively communicated with each other so that the data can be used as an example of a network.

    The data was primarily generated to evaluate the performance of de-identification software. Therefore, the text in the DDPs particularly contain many randomly chosen (Dutch) first names, phone numbers, e-mail addresses and URLS. In addition, the images in the DDPs contain many faces and text as well. The DDPs contain faces and text (usernames) of third parties. However, only content of so-called `professional accounts' are shared, such as accounts of famous individuals or institutions who self-consciously and actively seek publicity, and these sources are easily publicly available. Furthermore, the DDPs do not contain sensitive personal data of these individuals.

    Obtaining your Instagram DDP

    After using the Instagram accounts intensively for approximately a week, the participants requested their personal Instagram DDPs by using the following steps. You can follow these steps yourself if you are interested in your personal Instagram DDP.

    1. Go to www.instagram.com and log in
    2. Click on your profile picture, go to Settings and Privacy and Security
    3. Scroll to Data download and click Request download
    4. Enter your email adress and click Next
    5. Enter your password and click Request download

    Instagram then delivered the data in a compressed zip folder with the format username_YYYYMMDD.zip (i.e., Instagram handle and date of download) to the participant, and the participants shared these DDPs with us.

    Data cleaning

    To comply with the Instagram user agreement, participants shared their full name, phone number and e-mail address. In addition, Instagram logged the i.p. addresses the participant used during their active period on Instagram. After colleting the DDPs, we manually replaced such information with random replacements such that the DDps shared here do not contain any personal data of the participants.

    How this data-set can be used

    This data-set was generated with the intention to evaluate the performance of the de-identification software. We invite other researchers to use this data-set for example to investigate what type of data can be found in Instagram DDPs or to investigate the structure of Instagram DDPs. The packages can also be used for example data-analyses, although no substantive research questions can be answered using this data as the data does not reflect how research subjects behave `in the wild'.

    Authors

    The data collection is executed by Laura Boeschoten, Ruben van den Goorbergh and Daniel Oberski of Utrecht University. For questions, please contact l.boeschoten@uu.nl.

    Acknowledgments

    The researchers would like to thank everyone who participated in this data-generation project.

  16. P

    Data from: WebText Dataset

    • paperswithcode.com
    Updated Jul 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever (2022). WebText Dataset [Dataset]. https://paperswithcode.com/dataset/webtext
    Explore at:
    Dataset updated
    Jul 10, 2022
    Authors
    Alec Radford; Jeffrey Wu; Rewon Child; David Luan; Dario Amodei; Ilya Sutskever
    Description

    WebText is an internal OpenAI corpus created by scraping web pages with emphasis on document quality. The authors scraped all outbound links from Reddit which received at least 3 karma. The authors used the approach as a heuristic indicator for whether other users found the link interesting, educational, or just funny.

    WebText contains the text subset of these 45 million links. It consists of over 8 million documents for a total of 40 GB of text. All Wikipedia documents were removed from WebText since it is a common data source for other datasets.

  17. d

    Data from: ViTexOCR; a script to extract text overlays from digital video

    • catalog.data.gov
    • data.usgs.gov
    • +5more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). ViTexOCR; a script to extract text overlays from digital video [Dataset]. https://catalog.data.gov/dataset/vitexocr-a-script-to-extract-text-overlays-from-digital-video
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    The ViTexOCR script presents a new method for extracting navigation data from videos with text overlays using optical character recognition (OCR) software. Over the past few decades, it was common for videos recorded during surveys to be overlaid with real-time geographic positioning satellite chyrons including latitude, longitude, date and time, as well as other ancillary data (such as speed, heading, or user input identifying fields). Embedding these data into videos provides them with utility and accuracy, but using the location data for other purposes, such as analysis in a geographic information system, is not possible when only available on the video display. Extracting the text data from imagery using software allows these videos to be located and analyzed in a geospatial context. The script allows a user to select a video, specify the text data types (e.g. latitude, longitude, date, time, or other), text color, and the pixel locations of overlay text data on a sample video frame. The script’s output is a data file containing the retrieved geospatial and temporal data. All functionality is bundled in a Python script that incorporates a graphical user interface and several other software dependencies.

  18. H

    Replication Data for: Active Learning Approaches for Labeling Text: Review...

    • dataverse.harvard.edu
    Updated Dec 11, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Blake Miller; Fridolin Linder; Walter Mebane (2019). Replication Data for: Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches [Dataset]. http://doi.org/10.7910/DVN/T88EAX
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 11, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Blake Miller; Fridolin Linder; Walter Mebane
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or `passive' learning) to achieve equally performing classifiers. We further investigate how varying levels of inter-coder reliability affect the active learning procedures and find that even with low-reliability active learning performs more efficiently than does random sampling.

  19. Textual Entailment Dataset

    • kaggle.com
    Updated Dec 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Textual Entailment Dataset [Dataset]. https://www.kaggle.com/datasets/thedevastator/textual-entailment-dataset/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 5, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Textual Entailment Dataset

    Textual Entailment Dataset with Labelled Text Pairs

    By SetFit (From Huggingface) [source]

    About this dataset

    The SetFit/mnli dataset is a comprehensive collection of textual entailment data designed to facilitate the development and evaluation of models for natural language understanding tasks. This dataset includes three distinct files: validation.csv, train.csv, and test.csv, each containing valuable information for training and evaluating textual entailment models.

    In these files, users will find various columns providing important details about the text pairs. The text1 and text2 columns indicate the first and second texts in each pair respectively, allowing researchers to analyze the relationships between these texts. Additionally, the label column provides a categorical value indicating the specific relationship between text1 and text2.

    To further aid in understanding the relationships expressed by these labels, there is an accompanying label_text column that offers a human-readable representation of each categorical label. This allows practitioners to interpret and analyze the labeled data more easily.

    Moreover, all three files in this dataset contain an additional index column called idx, which assists in organizing and referencing specific samples within the dataset during analysis or model development.

    It's worth noting that this SetFit/mnli dataset has been carefully prepared for textual entailment tasks specifically. To ensure accurate evaluation of model performance on such tasks, researchers can leverage validation.csv as a dedicated set of samples specifically reserved for validating their models' performance during training. The train.csv file contains ample training data with corresponding labels that can be utilized to effectively train reliable textual entailment models. Lastly, test.csv includes test samples designed for evaluating model performance on textual entailment tasks.

    By utilizing this extensive collection of high-quality data provided by SetFit/mnli dataset, researchers can develop powerful models capable of accurately understanding natural language relationships expressed within text pairs across various domains

    How to use the dataset

    • text1: This column contains the first text in a pair.
    • text2: This column contains the second text in a pair.
    • label: The label column indicates the relationship between text1 and text2 using categorical values.
    • label_text: The label_text column provides the text representation of the labels.

    To effectively use this dataset for your textual entailment task, follow these steps:

    1. Understanding the Columns

    Start by familiarizing yourself with the different columns present in each file of this dataset:

    • text1: The first text in a pair that needs to be evaluated for textual entailment.
    • text2: The second text in a pair that needs to be compared with text1 to determine its logical relationship.
    • label: This categorical field represents predefined relationships or categories between texts based on their meaning or logical inference.
    • label_text: A human-readable representation of each label category that helps understand their real-world implications.

    2. Data Exploration

    Before building models or applying any algorithms, it's essential to explore and understand your data thoroughly:

    • Analyze sample data points from each file (validation.csv, train.csv).
    • Identify any class imbalances within different labels present in your data distribution.

    3. Preprocessing Steps

    • Handle missing values: Check if there are any missing values (NaNs) within any columns and decide how to handle them.
    • Text cleaning: Depending on the nature of your task, implement appropriate text cleaning techniques like removing stop words, lowercasing, punctuation removal, etc.
    • Tokenization: Break down the text into individual tokens or words to facilitate further processing steps.

    4. Model Training and Evaluation

    Once your dataset is ready for modeling:

    • Split your data into training and testing sets using the train.csv and test.csv files. This division allows you to train models on a subset of data while evaluating their performance on an unseen portion.
    • Utilize machine learning or deep learning algorithms suitable for textual entailment tasks (e.g., BERT

    Research Ideas

    • Natural Language Understanding: The dataset can be used for training and evaluating models that perform natural language understanding tasks, such as text classification, ...
  20. Data from: Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph...

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata (2023). Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text [Dataset]. http://doi.org/10.5281/zenodo.7916716
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 23, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Nandana Mihindukulasooriya; Nandana Mihindukulasooriya; Sanju Tiwari; Sanju Tiwari; Carlos F. Enguix; Carlos F. Enguix; Kusum Lata; Kusum Lata
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying with the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

    It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

    An example

    An example test sentence:

    Test Sentence:
    {"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
    American songwriters Gerry Goffin and Carole King."}
    

    An example of ontology:

    Ontology: Music Ontology

    Expected Output:

    {
     "id": "ont_k_music_test_n", 
     "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
     "triples": [
     {
      "sub": "The Loco-Motion", 
      "rel": "publication date",
      "obj": "01 January 1962"
     },{
      "sub": "The Loco-Motion",
      "rel": "lyrics by",
      "obj": "Gerry Goffin"
     },{
      "sub": "The Loco-Motion", 
      "rel": "lyrics by", 
      "obj": "Carole King"
     },]
    }
    

    The data is released under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

    The structure of the repo is as the following.

    This benchmark contains data derived from the TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

    [1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

    [2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Jacob Moore (2024). text-clustering-example-data [Dataset]. https://huggingface.co/datasets/billingsmoore/text-clustering-example-data

text-clustering-example-data

billingsmoore/text-clustering-example-data

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 20, 2024
Authors
Jacob Moore
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

Dataset Card for Dataset Name

This dataset consists of 925 sentences in English paired with a broad topic descriptor for use as example data in product demonstrations or student projects.

Curated by: billingsmoore Language(s) (NLP): English License: Apache License 2.0

  Direct Use

This data can be loaded using the following Python code. from datasets import load_dataset

ds = load_dataset('billingsmoore/text-clustering-example-data')

It can then be clustered using the… See the full description on the dataset page: https://huggingface.co/datasets/billingsmoore/text-clustering-example-data.

Search
Clear search
Close search
Google apps
Main menu