100+ datasets found

T
Corpus of African Digital News from 600 Websites Formatted for Text Mining /...
dataverse.tdl.org
dataverse-prod.tdl.org
application/gzip, csv +1
Updated Mar 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy (2021). Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis [Dataset]. http://doi.org/10.18738/T8/UKJZ3E
Explore at:
application/gzip(8896500), application/gzip(5435878), csv(13256), application/gzip(10982329), application/gzip(10871486), application/gzip(10253389), application/gzip(4767585), pdf(118565), application/gzip(14010653), application/gzip(7694220), application/gzip(4829340), application/gzip(7887635), application/gzip(6466273), application/gzip(6721855), application/gzip(10042738), application/gzip(4731617), application/gzip(6310696), application/gzip(3494342), application/gzip(16196922), application/gzip(5264947), application/gzip(10277694), application/gzip(5982318), application/gzip(9675077), application/gzip(5277756), application/gzip(5565486), application/gzip(7850895), application/gzip(9296638), application/gzip(109496838), application/gzip(8052199), application/gzip(7811465), application/gzip(7789984), application/gzip(4388992), application/gzip(10732802), application/gzip(9913220), application/gzip(6911435)Available download formats
Unique identifier
https://doi.org/10.18738/T8/UKJZ3E
Dataset updated
Mar 12, 2021
Dataset provided by
Texas Data Repository
Authors
D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Time period covered
Dec 6, 2020 - Jan 4, 2021
Description
This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.
d
SIAM 2007 Text Mining Competition dataset
catalog.data.gov
data.amerigeoss.org
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.
d
Anomaly Detection with Text Mining
catalog.data.gov
data.nasa.gov
+2more
Updated Dec 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Anomaly Detection with Text Mining [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-with-text-mining
Explore at:
Dataset updated
Dec 6, 2023
Dataset provided by
Dashlink
Description
Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.
Text Mining Dataset
zenodo.org
bin
Updated Jan 15, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anonymos; Anonymos (2022). Text Mining Dataset [Dataset]. http://doi.org/10.5281/zenodo.5853572
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5853572
Dataset updated
Jan 15, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anonymos; Anonymos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Text Mining Dataset
Teaching & Learning Team Text Mining Workshop
figshare.com
pdf
Updated Aug 6, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elizabeth Joan Kelly (2018). Teaching & Learning Team Text Mining Workshop [Dataset]. http://doi.org/10.6084/m9.figshare.6938138.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6938138.v1
Dataset updated
Aug 6, 2018
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Elizabeth Joan Kelly
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Materials from Voyant workshop conducted for library liaisons as part of TLT/Faculty Development/Digital Scholarship. Objectives:-Analyze text with Voyant-Identify uses for text mining in instruction-Design accessible instruction for working with textAssociated Research Guide: http://researchguides.loyno.edu/text_workshop
k
Text-Analysis
kaggle.com
Updated Apr 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Text-Analysis [Dataset]. https://www.kaggle.com/datasets/vivek603/text-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 14, 2023
Description
Title: Text-Analysis Dataset with Stopwords, Positive Words, and Negative Words

Description: This dataset is designed for text analysis tasks and contains three types of words: stopwords, positive words, and negative words. Stopwords are common words that are typically removed from text during preprocessing because they don't carry much meaning, such as "the," "and," "a," etc. Positive words are words that convey a positive sentiment, while negative words are words that convey a negative sentiment.

The stopwords were obtained from a standard list used in natural language processing, while the positive and negative words were obtained from publicly available sentiment lexicons.

Each word is provided as a separate entry in the dataset.

The dataset is provided in CSV format and is suitable for use in various text analysis tasks, such as sentiment analysis, text classification, and natural language processing.

Columns: All the csvs contain a single column having the specified set of words.

EG: positive-words.txt a+ abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative . . . and so on

This dataset can be used to build models that can automatically classify text as positive or negative, or to identify which words are likely to carry more meaning in a given text.
Data from: Text and Data Mining: Seeking Traction
osf.io
Updated Mar 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas Padilla (2018). Text and Data Mining: Seeking Traction [Dataset]. https://osf.io/ns5xz
Explore at:
Dataset updated
Mar 14, 2018
Dataset provided by
Center for Open Sciencehttps://cos.io/
Authors
Thomas Padilla
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Working in libraries affords many opportunities to engage challenges associated with text and data mining (TDM) limited access datasets. Challenges afford themselves across interactions with diverse disciplinary communities at institutions with varying resources. Increasingly, data requested by research communities fall outside the traditional purview of library collection development and research support. Solutions to TDM challenges are few given gaps in understanding and misaligned values. Continued activity in light of these factors fosters an environment where weaknesses and threats are many and seemingly tractable opportunities are few. In the space of this brief statement I will introduce four challenges: (1) underdeveloped and inconsistent content provider effort to meet text and data mining need, (2) misaligned values reinforced by ambiguous and/or overly restrictive content provider terms, (3) debt incurred by technical abstraction, and (4) purposeful technical opacity.
Text Mining
kaggle.com
zip
Updated Jul 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Samratsingh Dikkhat (2022). Text Mining [Dataset]. https://www.kaggle.com/datasets/samratsinghdikkhat/text-mining
Explore at:
zip(25421 bytes)Available download formats
Dataset updated
Jul 2, 2022
Authors
Samratsingh Dikkhat
Description
Dataset

This dataset was created by Samratsingh Dikkhat

Contents
text-mining
kaggle.com
zip
Updated Apr 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zoumana (2018). text-mining [Dataset]. https://www.kaggle.com/keitazoumana/textmining
Explore at:
zip(409259 bytes)Available download formats
Dataset updated
Apr 24, 2018
Authors
Zoumana
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Dataset

This dataset was created by Zoumana

Released under CC0: Public Domain

Contents
o
Data from: Identifying Missing Data Handling Methods with Text Mining
openicpsr.org
delimited
Updated Mar 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
Explore at:
delimitedAvailable download formats
Unique identifier
https://doi.org/10.3886/E185961V1
Dataset updated
Mar 8, 2023
Dataset provided by
Hungarian Academy of Sciences
Authors
Krisztián Boros; Zoltán Kmetty
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Jan 1, 1999 - Dec 31, 2016
Description
Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.
w
Data from: Text mining with machine learning : principles and techniques
workwithdata.com
Updated Sep 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Work With Data (2022). Text mining with machine learning : principles and techniques [Dataset]. https://www.workwithdata.com/book/text-mining-machine-learning-principles-techniques-book-by-jan-zizka-0000
Explore at:
Dataset updated
Sep 20, 2022
Dataset authored and provided by
Work With Data
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Explore Text mining with machine learning : principles and techniques through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets
S
NASICON-type solid electrolyte materials named entity recognition dataset
scidb.cn
Updated Apr 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
Explore at:
Unique identifier
https://doi.org/10.57760/sciencedb.j00213.00001
Dataset updated
Apr 27, 2023
Dataset provided by
ScienceDB
Authors
Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
Description
1.Framework overview.\t\tThis paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information.\t\tThe experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing.\t\tFirstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token
[Coursera ] Text Mining and Analytics
academictorrents.com
bittorrent
Updated Jan 23, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Coursera (2017). [Coursera ] Text Mining and Analytics [Dataset]. https://academictorrents.com/details/e2c129491a3841bfac5d7b08b41ad79387132a23
Explore at:
bittorrentAvailable download formats
Dataset updated
Jan 23, 2017
Dataset authored and provided by
Courserahttp://coursera.org/
License
https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified
Description
A BitTorrent file to download data with the title '[Coursera ] Text Mining and Analytics'
m
Product Reviews Dataset for Emotions Classification Tasks - Indonesian...
data.mendeley.com
Updated May 19, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rhio Sutoyo (2022). Product Reviews Dataset for Emotions Classification Tasks - Indonesian (PRDECT-ID) Dataset [Dataset]. http://doi.org/10.17632/574v66hf2v.1
Explore at:
Unique identifier
https://doi.org/10.17632/574v66hf2v.1
Dataset updated
May 19, 2022
Authors
Rhio Sutoyo
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
PRDECT-ID Dataset is a collection of Indonesian product review data annotated with emotion and sentiment labels. The data were collected from one of the giant e-commerce in Indonesia named Tokopedia. The dataset contains product reviews from 29 product categories on Tokopedia that use the Indonesian language. Each product review is annotated with a single emotion, i.e., love, happiness, anger, fear, or sadness. The group of annotators does the annotation process to provide emotion labels by following the emotions annotation criteria created by an expert in clinical psychology. Other attributes related to the product review are also extracted, such as Location, Price, Overall Rating, Number Sold, Total Review, and Customer Rating, to support further research.
m
Positive And Negative Corpus
data.mendeley.com
Updated May 24, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdullah Al Taawab (2022). Positive And Negative Corpus [Dataset]. http://doi.org/10.17632/s6mtp2zzpc.3
Explore at:
Unique identifier
https://doi.org/10.17632/s6mtp2zzpc.3
Dataset updated
May 24, 2022
Authors
Abdullah Al Taawab
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The repository contains 1,300 transliterated Bengali comments from which 647 are marked as positive comments, 653 comments are reviewed as negative. These comments are manually annotated and collected from facebook and youtube using data scapping tool. Positive comments are labeled as 0 and Negative comments are labeled as 1.
m
Social Communication Database
data.mendeley.com
Updated Dec 9, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kailas PATIL (2019). Social Communication Database [Dataset]. http://doi.org/10.17632/wf5d5b2j52.1
Explore at:
Unique identifier
https://doi.org/10.17632/wf5d5b2j52.1
Dataset updated
Dec 9, 2019
Authors
Kailas PATIL
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset was generated by real world text communication in the group conversation. So, maintain the anonymity of the usernames and the mobile numbers of those who participated in the survey. This dataset can be used for education and research purpose only.

The main contribution of this research in text mining is to bring into being a standard dataset for research purpose
in the realm of mining the chat conversation. We have observed that this dataset has an immense density to be utilized for research purpose. Our applications based on this dataset, you can utilize this dataset into semantic search, sentiment analysis, semantic clustering of conversation, topic extraction, spam detection, etc. We wish to offer this dataset for others to collaborate and research on further possibilities.

We have used our algorithms to extract the textual information from whatsapp logs and stored it in a sqlite database file named as "social conversation.db".

This dataset contains 16225 text messages and 839 distinct users. We have considered 17 whatsapp groups for extracting the textual information.

Paper: Analysis of foul language usage in social media text conversation Authors: Sumit Kawate and Kailas Patil In the Proceedings of the Int. J. Social Media and Interactive Learning Environments (IJSMILE), Vol: 05, Issue: 03, Pages: 227-251, Inderscience, 201 DOI: https://doi.org/10.1504/IJSMILE.2017.087976

The data is stored in an .zip compressed archives. The uncompressed archive is in 6,020 KB (5.87 MB). Extract with any uncompressed standard software.

The archive contains the following items:

DATABASE/ | + Executable/ Directory containing executable files. | | + Social Conversation.db This file contains records of the database in .db format | + Source Code/ Directory containing source code files. | + Social Conversation (csv).csv This file contains records of the database in .csv format + Social Conversation (db).db This file contains records of the database in .db format + Social Conversation (html).html This file contains records of the database in .html format. | + Read Me/ Directory containing read me file. | | read me.txt This file contains detail information about dataset.

The data format of the dataset are:

=Table Name= -> CONVERSATION

=Atttributes= =Meaning= USER_ID User id of the text message TEXT_MSG TextActual text message CONTACT_NUMBER Contact number of the user (We have masked the few digits of contact number of the user) DATE Date of the text message TIME Time of the text message

=Atttributes= =Format= USER_ID User Id TEXT_MSG (text messsage in any format) CONTACT_NUMBER +contactnumber DATE dd/mm/yy TIME hh:mm AM or hh:mm PM

=Atttributes= =Sample Example= USER_ID User 514 TEXT_MSG Any deal on formal shoes with prime CONTACT_NUMBER +919xxxxx927 DATE 04/12/17

TIME 1:35 AM
E
Data from: Antibody Watch: Text Mining Antibody Specificity from the...
live.european-language-grid.eu
zenodo.org
ms-excel
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Antibody Watch: Text Mining Antibody Specificity from the Literature [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7822
Explore at:
ms-excelAvailable download formats
Dataset updated
Nov 24, 2023
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract Motivation: Antibodies are widely used reagents to test for expression of proteins. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results.Results: We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report any antibody that is nonspecific, and thus problematic. The second task is to link each of these snippets to one or more antibodies that the snippet referred to. We leveraged the Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.
AI usage for text mining in Norway in 2023, by industry
statista.com
Updated Mar 20, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2024). AI usage for text mining in Norway in 2023, by industry [Dataset]. https://www.statista.com/statistics/1456827/text-mining-ai-usage-norway/
Explore at:
Dataset updated
Mar 20, 2024
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Norway
Description
In 2023, information and communication was the industry with the most usage of artificial intelligence (AI) for text mining in Norway with a share of 16 percent. Wholesale trade, except of motor vehicles and motorcycles was the industry in second place with a 5 percent share.
m
Enron Authorship Verification Corpus
data.mendeley.com
search.datacite.org
Updated Sep 10, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oren Halvani (2018). Enron Authorship Verification Corpus [Dataset]. http://doi.org/10.17632/n77w7mygwg.2
Explore at:
Unique identifier
https://doi.org/10.17632/n77w7mygwg.2
Dataset updated
Sep 10, 2018
Authors
Oren Halvani
License
http://www.gnu.org/licenses/gpl-3.0.en.htmlhttp://www.gnu.org/licenses/gpl-3.0.en.html
Description
===========================================

Type of corpus:

The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset" [1], which has been used across different research domains beyong Authorship Verification (AV). The intention behind this corpus is to provide other researchers in the field of AV the opportunity to compare their results to each other.

===========================================

Language:

All texts are written in English.

===========================================

Format of the corpus:

The corpus was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" [2]. It consists of 80 AV cases, evenly distributed regarding true (Y) and false (N) authorships, as well as the ground truth (Y/N) regarding all AV cases. Each AV case comprise up to 5 documents (plain text files), where 2-4 documents stem from a known author, while the 5th document has an unknown authorship and, thus, is the subject of verification. Each document has been written by a single author X and is mostly aggregated from several mails of X, in order to provide a sufficient length that captures X's writing style.

===========================================

Preprocessing steps:

All texts in the corpus were preprocessed by hand, which resulted in an overall processing time of more than 30 hours. The preprocessing includes de-duplication, normalization of utf-8 symbols as well as the removal of URLs, e-mail headers, signatures and other metadata. Beyond these, the texts themselves have been undergone a variety of cleaning procedures including the removal of greetings/closing formulas, (telephone) numbers, named entities (names of people, companies, locations, etc.), quotes and repetitions of identical characters/symbols and words. As a last preprocessing step, multiple successive blanks, newlines and tabs were substituted with a single blank.

===========================================

Basic statistics:

The length of each preprocessed text ranges from 2,200-5,000 characters. More precisely, the average length of each known document is 3976 characters, while the average length of each unknown document is 3899 characters.

===========================================

Paper + Citation:

https://link.springer.com/chapter/10.1007/978-3-319-98932-7_4

===========================================

References:

[1] https://www.cs.cmu.edu/~enron [2] http://pan.webis.de
m
Emerging Topics in Project Management Research: A text mining analysis of...
data.mendeley.com
Updated Jan 13, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vito Giordano (2022). Emerging Topics in Project Management Research: A text mining analysis of Scientific Literautre [Dataset]. http://doi.org/10.17632/37g2cst34c.2
Explore at:
Unique identifier
https://doi.org/10.17632/37g2cst34c.2
Dataset updated
Jan 13, 2022
Authors
Vito Giordano
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset contains the text analysis performed with Natural Language Processing (NLP) techniques on scientific articles of Project Management (PM), collected from Scopus. The dataset contains the list of tokens extracted from abstract papers with the number of papers contains the token and the rate of growth. The tokens are used for identifying the emerging research topics in the PM literature.

Facebook

Twitter

Click to copy link

Link copied

Cite

D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy (2021). Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis [Dataset]. http://doi.org/10.18738/T8/UKJZ3E

Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis

Explore at:

application/gzip(8896500), application/gzip(5435878), csv(13256), application/gzip(10982329), application/gzip(10871486), application/gzip(10253389), application/gzip(4767585), pdf(118565), application/gzip(14010653), application/gzip(7694220), application/gzip(4829340), application/gzip(7887635), application/gzip(6466273), application/gzip(6721855), application/gzip(10042738), application/gzip(4731617), application/gzip(6310696), application/gzip(3494342), application/gzip(16196922), application/gzip(5264947), application/gzip(10277694), application/gzip(5982318), application/gzip(9675077), application/gzip(5277756), application/gzip(5565486), application/gzip(7850895), application/gzip(9296638), application/gzip(109496838), application/gzip(8052199), application/gzip(7811465), application/gzip(7789984), application/gzip(4388992), application/gzip(10732802), application/gzip(9913220), application/gzip(6911435)Available download formats

Unique identifier

https://doi.org/10.18738/T8/UKJZ3E

Dataset updated

Mar 12, 2021

Dataset provided by

Texas Data Repository

Authors

D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered

Dec 6, 2020 - Jan 4, 2021

Description

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.

Clear search

Close search

Google apps

Main menu

Corpus of African Digital News from 600 Websites Formatted for Text Mining /...

SIAM 2007 Text Mining Competition dataset

Anomaly Detection with Text Mining

Text Mining Dataset

Teaching & Learning Team Text Mining Workshop

Text-Analysis

Data from: Text and Data Mining: Seeking Traction

Text Mining

Dataset

Contents

text-mining

Dataset

Contents

Data from: Identifying Missing Data Handling Methods with Text Mining

Data from: Text mining with machine learning : principles and techniques

NASICON-type solid electrolyte materials named entity recognition dataset

[Coursera ] Text Mining and Analytics

Product Reviews Dataset for Emotions Classification Tasks - Indonesian...

Positive And Negative Corpus

Social Communication Database

The archive contains the following items:

The data format of the dataset are:

TIME 1:35 AM

Data from: Antibody Watch: Text Mining Antibody Specificity from the...

AI usage for text mining in Norway in 2023, by industry

Enron Authorship Verification Corpus

Type of corpus:

Language:

Format of the corpus:

Preprocessing steps:

Basic statistics:

Paper + Citation:

References:

Emerging Topics in Project Management Research: A text mining analysis of...

Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis