CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Text Mining Dataset
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Materials from Voyant workshop conducted for library liaisons as part of TLT/Faculty Development/Digital Scholarship. Objectives:-Analyze text with Voyant-Identify uses for text mining in instruction-Design accessible instruction for working with textAssociated Research Guide: http://researchguides.loyno.edu/text_workshop
Title: Text-Analysis Dataset with Stopwords, Positive Words, and Negative Words
Description: This dataset is designed for text analysis tasks and contains three types of words: stopwords, positive words, and negative words. Stopwords are common words that are typically removed from text during preprocessing because they don't carry much meaning, such as "the," "and," "a," etc. Positive words are words that convey a positive sentiment, while negative words are words that convey a negative sentiment.
The stopwords were obtained from a standard list used in natural language processing, while the positive and negative words were obtained from publicly available sentiment lexicons.
Each word is provided as a separate entry in the dataset.
The dataset is provided in CSV format and is suitable for use in various text analysis tasks, such as sentiment analysis, text classification, and natural language processing.
Columns: All the csvs contain a single column having the specified set of words.
EG: positive-words.txt a+ abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative . . . and so on
This dataset can be used to build models that can automatically classify text as positive or negative, or to identify which words are likely to carry more meaning in a given text.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Working in libraries affords many opportunities to engage challenges associated with text and data mining (TDM) limited access datasets. Challenges afford themselves across interactions with diverse disciplinary communities at institutions with varying resources. Increasingly, data requested by research communities fall outside the traditional purview of library collection development and research support. Solutions to TDM challenges are few given gaps in understanding and misaligned values. Continued activity in light of these factors fosters an environment where weaknesses and threats are many and seemingly tractable opportunities are few. In the space of this brief statement I will introduce four challenges: (1) underdeveloped and inconsistent content provider effort to meet text and data mining need, (2) misaligned values reinforced by ambiguous and/or overly restrictive content provider terms, (3) debt incurred by technical abstraction, and (4) purposeful technical opacity.
This dataset was created by Samratsingh Dikkhat
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was created by Zoumana
Released under CC0: Public Domain
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explore Text mining with machine learning : principles and techniques through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
1.Framework overview.\t\tThis paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information.\t\tThe experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing.\t\tFirstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
A BitTorrent file to download data with the title '[Coursera ] Text Mining and Analytics'
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PRDECT-ID Dataset is a collection of Indonesian product review data annotated with emotion and sentiment labels. The data were collected from one of the giant e-commerce in Indonesia named Tokopedia. The dataset contains product reviews from 29 product categories on Tokopedia that use the Indonesian language. Each product review is annotated with a single emotion, i.e., love, happiness, anger, fear, or sadness. The group of annotators does the annotation process to provide emotion labels by following the emotions annotation criteria created by an expert in clinical psychology. Other attributes related to the product review are also extracted, such as Location, Price, Overall Rating, Number Sold, Total Review, and Customer Rating, to support further research.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository contains 1,300 transliterated Bengali comments from which 647 are marked as positive comments, 653 comments are reviewed as negative. These comments are manually annotated and collected from facebook and youtube using data scapping tool. Positive comments are labeled as 0 and Negative comments are labeled as 1.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset was generated by real world text communication in the group conversation. So, maintain the anonymity of the usernames and the mobile numbers of those who participated in the survey. This dataset can be used for education and research purpose only.
The main contribution of this research in text mining is to bring into being a standard dataset for research purpose
in the realm of mining the chat conversation. We have observed that this dataset has an immense density to be utilized
for research purpose. Our applications based on this dataset, you can utilize this dataset into semantic search,
sentiment analysis, semantic clustering of conversation, topic extraction, spam detection, etc. We wish to offer
this dataset for others to collaborate and research on further possibilities.
We have used our algorithms to extract the textual information from whatsapp logs and stored it in a sqlite database file named as "social conversation.db".
This dataset contains 16225 text messages and 839 distinct users. We have considered 17 whatsapp groups for extracting the textual information.
Paper: Analysis of foul language usage in social media text conversation Authors: Sumit Kawate and Kailas Patil In the Proceedings of the Int. J. Social Media and Interactive Learning Environments (IJSMILE), Vol: 05, Issue: 03, Pages: 227-251, Inderscience, 201 DOI: https://doi.org/10.1504/IJSMILE.2017.087976
The data is stored in an .zip compressed archives. The uncompressed archive is in 6,020 KB (5.87 MB). Extract with any uncompressed standard software.
DATABASE/ | + Executable/ Directory containing executable files. | | + Social Conversation.db This file contains records of the database in .db format | + Source Code/ Directory containing source code files. | + Social Conversation (csv).csv This file contains records of the database in .csv format + Social Conversation (db).db This file contains records of the database in .db format + Social Conversation (html).html This file contains records of the database in .html format. | + Read Me/ Directory containing read me file. | | read me.txt This file contains detail information about dataset.
=Table Name= -> CONVERSATION
=Atttributes= =Meaning= USER_ID User id of the text message TEXT_MSG TextActual text message CONTACT_NUMBER Contact number of the user (We have masked the few digits of contact number of the user) DATE Date of the text message TIME Time of the text message
=Atttributes= =Format= USER_ID User Id TEXT_MSG (text messsage in any format) CONTACT_NUMBER +contactnumber DATE dd/mm/yy TIME hh:mm AM or hh:mm PM
=Atttributes= =Sample Example= USER_ID User 514 TEXT_MSG Any deal on formal shoes with prime CONTACT_NUMBER +919xxxxx927 DATE 04/12/17
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract Motivation: Antibodies are widely used reagents to test for expression of proteins. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results.Results: We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report any antibody that is nonspecific, and thus problematic. The second task is to link each of these snippets to one or more antibodies that the snippet referred to. We leveraged the Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.
In 2023, information and communication was the industry with the most usage of artificial intelligence (AI) for text mining in Norway with a share of 16 percent. Wholesale trade, except of motor vehicles and motorcycles was the industry in second place with a 5 percent share.
http://www.gnu.org/licenses/gpl-3.0.en.htmlhttp://www.gnu.org/licenses/gpl-3.0.en.html
===========================================
The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset" [1], which has been used across different research domains beyong Authorship Verification (AV). The intention behind this corpus is to provide other researchers in the field of AV the opportunity to compare their results to each other.
===========================================
All texts are written in English.
===========================================
The corpus was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" [2]. It consists of 80 AV cases, evenly distributed regarding true (Y) and false (N) authorships, as well as the ground truth (Y/N) regarding all AV cases. Each AV case comprise up to 5 documents (plain text files), where 2-4 documents stem from a known author, while the 5th document has an unknown authorship and, thus, is the subject of verification. Each document has been written by a single author X and is mostly aggregated from several mails of X, in order to provide a sufficient length that captures X's writing style.
===========================================
All texts in the corpus were preprocessed by hand, which resulted in an overall processing time of more than 30 hours. The preprocessing includes de-duplication, normalization of utf-8 symbols as well as the removal of URLs, e-mail headers, signatures and other metadata. Beyond these, the texts themselves have been undergone a variety of cleaning procedures including the removal of greetings/closing formulas, (telephone) numbers, named entities (names of people, companies, locations, etc.), quotes and repetitions of identical characters/symbols and words. As a last preprocessing step, multiple successive blanks, newlines and tabs were substituted with a single blank.
===========================================
The length of each preprocessed text ranges from 2,200-5,000 characters. More precisely, the average length of each known document is 3976 characters, while the average length of each unknown document is 3899 characters.
===========================================
https://link.springer.com/chapter/10.1007/978-3-319-98932-7_4
===========================================
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the text analysis performed with Natural Language Processing (NLP) techniques on scientific articles of Project Management (PM), collected from Scopus. The dataset contains the list of tokens extracted from abstract papers with the number of papers contains the token and the rate of growth. The tokens are used for identifying the emerging research topics in the PM literature.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.