100+ datasets found
  1. T

    Corpus of African Digital News from 600 Websites Formatted for Text Mining /...

    • dataverse.tdl.org
    • dataverse-prod.tdl.org
    application/gzip, csv +1
    Updated Mar 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy (2021). Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis [Dataset]. http://doi.org/10.18738/T8/UKJZ3E
    Explore at:
    application/gzip(8896500), application/gzip(5435878), csv(13256), application/gzip(10982329), application/gzip(10871486), application/gzip(10253389), application/gzip(4767585), pdf(118565), application/gzip(14010653), application/gzip(7694220), application/gzip(4829340), application/gzip(7887635), application/gzip(6466273), application/gzip(6721855), application/gzip(10042738), application/gzip(4731617), application/gzip(6310696), application/gzip(3494342), application/gzip(16196922), application/gzip(5264947), application/gzip(10277694), application/gzip(5982318), application/gzip(9675077), application/gzip(5277756), application/gzip(5565486), application/gzip(7850895), application/gzip(9296638), application/gzip(109496838), application/gzip(8052199), application/gzip(7811465), application/gzip(7789984), application/gzip(4388992), application/gzip(10732802), application/gzip(9913220), application/gzip(6911435)Available download formats
    Dataset updated
    Mar 12, 2021
    Dataset provided by
    Texas Data Repository
    Authors
    D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Dec 6, 2020 - Jan 4, 2021
    Description

    This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.

  2. d

    SIAM 2007 Text Mining Competition dataset

    • catalog.data.gov
    • data.amerigeoss.org
    Updated Dec 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). SIAM 2007 Text Mining Competition dataset [Dataset]. https://catalog.data.gov/dataset/siam-2007-text-mining-competition-dataset
    Explore at:
    Dataset updated
    Dec 7, 2023
    Dataset provided by
    Dashlink
    Description

    Subject Area: Text Mining Description: This is the dataset used for the SIAM 2007 Text Mining competition. This competition focused on developing text mining algorithms for document classification. The documents in question were aviation safety reports that documented one or more problems that occurred during certain flights. The goal was to label the documents with respect to the types of problems that were described. This is a subset of the Aviation Safety Reporting System (ASRS) dataset, which is publicly available. How Data Was Acquired: The data for this competition came from human generated reports on incidents that occurred during a flight. Sample Rates, Parameter Description, and Format: There is one document per incident. The datasets are in raw text format. All documents for each set will be contained in a single file. Each row in this file corresponds to a single document. The first characters on each line of the file are the document number and a tilde separats the document number from the text itself. Anomalies/Faults: This is a document category classification problem.

  3. d

    Anomaly Detection with Text Mining

    • catalog.data.gov
    • data.nasa.gov
    • +2more
    Updated Dec 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2023). Anomaly Detection with Text Mining [Dataset]. https://catalog.data.gov/dataset/anomaly-detection-with-text-mining
    Explore at:
    Dataset updated
    Dec 6, 2023
    Dataset provided by
    Dashlink
    Description

    Many existing complex space systems have a significant amount of historical maintenance and problem data bases that are stored in unstructured text forms. The problem that we address in this paper is the discovery of recurring anomalies and relationships between problem reports that may indicate larger systemic problems. We will illustrate our techniques on data from discrepancy reports regarding software anomalies in the Space Shuttle. These free text reports are written by a number of different people, thus the emphasis and wording vary considerably. With Mehran Sahami from Stanford University, I'm putting together a book on text mining called "Text Mining: Theory and Applications" to be published by Taylor and Francis.

  4. Text Mining Dataset

    • zenodo.org
    bin
    Updated Jan 15, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymos; Anonymos (2022). Text Mining Dataset [Dataset]. http://doi.org/10.5281/zenodo.5853572
    Explore at:
    binAvailable download formats
    Dataset updated
    Jan 15, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymos; Anonymos
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Text Mining Dataset

  5. Teaching & Learning Team Text Mining Workshop

    • figshare.com
    pdf
    Updated Aug 6, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elizabeth Joan Kelly (2018). Teaching & Learning Team Text Mining Workshop [Dataset]. http://doi.org/10.6084/m9.figshare.6938138.v1
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Aug 6, 2018
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Elizabeth Joan Kelly
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Materials from Voyant workshop conducted for library liaisons as part of TLT/Faculty Development/Digital Scholarship. Objectives:-Analyze text with Voyant-Identify uses for text mining in instruction-Design accessible instruction for working with textAssociated Research Guide: http://researchguides.loyno.edu/text_workshop

  6. k

    Text-Analysis

    • kaggle.com
    Updated Apr 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Text-Analysis [Dataset]. https://www.kaggle.com/datasets/vivek603/text-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 14, 2023
    Description

    Title: Text-Analysis Dataset with Stopwords, Positive Words, and Negative Words

    Description: This dataset is designed for text analysis tasks and contains three types of words: stopwords, positive words, and negative words. Stopwords are common words that are typically removed from text during preprocessing because they don't carry much meaning, such as "the," "and," "a," etc. Positive words are words that convey a positive sentiment, while negative words are words that convey a negative sentiment.

    The stopwords were obtained from a standard list used in natural language processing, while the positive and negative words were obtained from publicly available sentiment lexicons.

    Each word is provided as a separate entry in the dataset.

    The dataset is provided in CSV format and is suitable for use in various text analysis tasks, such as sentiment analysis, text classification, and natural language processing.

    Columns: All the csvs contain a single column having the specified set of words.

    EG: positive-words.txt a+ abound abounds abundance abundant accessable accessible acclaim acclaimed acclamation accolade accolades accommodative . . . and so on

    This dataset can be used to build models that can automatically classify text as positive or negative, or to identify which words are likely to carry more meaning in a given text.

  7. Data from: Text and Data Mining: Seeking Traction

    • osf.io
    Updated Mar 14, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thomas Padilla (2018). Text and Data Mining: Seeking Traction [Dataset]. https://osf.io/ns5xz
    Explore at:
    Dataset updated
    Mar 14, 2018
    Dataset provided by
    Center for Open Sciencehttps://cos.io/
    Authors
    Thomas Padilla
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Working in libraries affords many opportunities to engage challenges associated with text and data mining (TDM) limited access datasets. Challenges afford themselves across interactions with diverse disciplinary communities at institutions with varying resources. Increasingly, data requested by research communities fall outside the traditional purview of library collection development and research support. Solutions to TDM challenges are few given gaps in understanding and misaligned values. Continued activity in light of these factors fosters an environment where weaknesses and threats are many and seemingly tractable opportunities are few. In the space of this brief statement I will introduce four challenges: (1) underdeveloped and inconsistent content provider effort to meet text and data mining need, (2) misaligned values reinforced by ambiguous and/or overly restrictive content provider terms, (3) debt incurred by technical abstraction, and (4) purposeful technical opacity.

  8. Text Mining

    • kaggle.com
    zip
    Updated Jul 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samratsingh Dikkhat (2022). Text Mining [Dataset]. https://www.kaggle.com/datasets/samratsinghdikkhat/text-mining
    Explore at:
    zip(25421 bytes)Available download formats
    Dataset updated
    Jul 2, 2022
    Authors
    Samratsingh Dikkhat
    Description

    Dataset

    This dataset was created by Samratsingh Dikkhat

    Contents

  9. text-mining

    • kaggle.com
    zip
    Updated Apr 24, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zoumana (2018). text-mining [Dataset]. https://www.kaggle.com/keitazoumana/textmining
    Explore at:
    zip(409259 bytes)Available download formats
    Dataset updated
    Apr 24, 2018
    Authors
    Zoumana
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Zoumana

    Released under CC0: Public Domain

    Contents

  10. o

    Data from: Identifying Missing Data Handling Methods with Text Mining

    • openicpsr.org
    delimited
    Updated Mar 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krisztián Boros; Zoltán Kmetty (2023). Identifying Missing Data Handling Methods with Text Mining [Dataset]. http://doi.org/10.3886/E185961V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Mar 8, 2023
    Dataset provided by
    Hungarian Academy of Sciences
    Authors
    Krisztián Boros; Zoltán Kmetty
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1999 - Dec 31, 2016
    Description

    Missing data is an inevitable aspect of every empirical research. Researchers developed several techniques to handle missing data to avoid information loss and biases. Over the past 50 years, these methods have become more and more efficient and also more complex. Building on previous review studies, this paper aims to analyze what kind of missing data handling methods are used among various scientific disciplines. For the analysis, we used nearly 50.000 scientific articles that were published between 1999 and 2016. JSTOR provided the data in text format. Furthermore, we utilized a text-mining approach to extract the necessary information from our corpus. Our results show that the usage of advanced missing data handling methods such as Multiple Imputation or Full Information Maximum Likelihood estimation is steadily growing in the examination period. Additionally, simpler methods, like listwise and pairwise deletion, are still in widespread use.

  11. w

    Data from: Text mining with machine learning : principles and techniques

    • workwithdata.com
    Updated Sep 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2022). Text mining with machine learning : principles and techniques [Dataset]. https://www.workwithdata.com/book/text-mining-machine-learning-principles-techniques-book-by-jan-zizka-0000
    Explore at:
    Dataset updated
    Sep 20, 2022
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Explore Text mining with machine learning : principles and techniques through unique data from multiples sources: key facts, real-time news, interactive charts, detailed maps & open datasets

  12. S

    NASICON-type solid electrolyte materials named entity recognition dataset

    • scidb.cn
    Updated Apr 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi (2023). NASICON-type solid electrolyte materials named entity recognition dataset [Dataset]. http://doi.org/10.57760/sciencedb.j00213.00001
    Explore at:
    Dataset updated
    Apr 27, 2023
    Dataset provided by
    ScienceDB
    Authors
    Liu Yue; Liu Dahui; Yang Zhengwei; Shi Siqi
    Description

    1.Framework overview.\t\tThis paper proposed a pipeline to construct high-quality datasets for text mining in materials science. Firstly, we utilize the traceable automatic acquisition scheme of literature to ensure the traceability of textual data. Then, a data processing method driven by downstream tasks is performed to generate high-quality pre-annotated corpora conditioned on the characteristics of materials texts. On this basis, we define a general annotation scheme derived from materials science tetrahedron to complete high-quality annotation. Finally, a conditional data augmentation model incorporating materials domain knowledge (cDA-DK) is constructed to augment the data quantity.2.Dataset information.\t\tThe experimental datasets used in this paper include: the Matscholar dataset publicly published by Weston et al. (DOI: 10.1021/acs.jcim.9b00470), and the NASICON entity recognition dataset constructed by ourselves. Herein, we mainly introduce the details of NASICON entity recognition dataset.2.1 Data collection and preprocessing.\t\tFirstly, 55 materials science literature related to NASICON system are collected through Crystallographic Information File (CIF), which contains a wealth of structure-activity relationship information. Note that materials science literature is mostly stored as portable document format (PDF), with content arranged in columns and mixed with tables, images, and formulas, which significantly compromises the readability of the text sequence. To tackle this issue, we employ the text parser PDFMiner (a Python toolkit) to standardize, segment, and parse the original documents, thereby converting PDF literature into plain text. In this process, the entire textual information of literature, encompassing title, author, abstract, keywords, institution, publisher, and publication year, is retained and stored as a unified TXT document. Subsequently, we apply rules based on Python regular expressions to remove redundant information, such as garbled characters and line breaks caused by figures, tables, and formulas. This results in a cleaner text corpus, enhancing its readability and enabling more efficient data analysis. Note that special symbols may also appear as garbled characters, but we refrain from directly deleting them, as they may contain valuable information such as chemical units. Therefore, we converted all such symbols to a special token

  13. [Coursera ] Text Mining and Analytics

    • academictorrents.com
    bittorrent
    Updated Jan 23, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Coursera (2017). [Coursera ] Text Mining and Analytics [Dataset]. https://academictorrents.com/details/e2c129491a3841bfac5d7b08b41ad79387132a23
    Explore at:
    bittorrentAvailable download formats
    Dataset updated
    Jan 23, 2017
    Dataset authored and provided by
    Courserahttp://coursera.org/
    License

    https://academictorrents.com/nolicensespecifiedhttps://academictorrents.com/nolicensespecified

    Description

    A BitTorrent file to download data with the title '[Coursera ] Text Mining and Analytics'

  14. m

    Product Reviews Dataset for Emotions Classification Tasks - Indonesian...

    • data.mendeley.com
    Updated May 19, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rhio Sutoyo (2022). Product Reviews Dataset for Emotions Classification Tasks - Indonesian (PRDECT-ID) Dataset [Dataset]. http://doi.org/10.17632/574v66hf2v.1
    Explore at:
    Dataset updated
    May 19, 2022
    Authors
    Rhio Sutoyo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    PRDECT-ID Dataset is a collection of Indonesian product review data annotated with emotion and sentiment labels. The data were collected from one of the giant e-commerce in Indonesia named Tokopedia. The dataset contains product reviews from 29 product categories on Tokopedia that use the Indonesian language. Each product review is annotated with a single emotion, i.e., love, happiness, anger, fear, or sadness. The group of annotators does the annotation process to provide emotion labels by following the emotions annotation criteria created by an expert in clinical psychology. Other attributes related to the product review are also extracted, such as Location, Price, Overall Rating, Number Sold, Total Review, and Customer Rating, to support further research.

  15. m

    Positive And Negative Corpus

    • data.mendeley.com
    Updated May 24, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdullah Al Taawab (2022). Positive And Negative Corpus [Dataset]. http://doi.org/10.17632/s6mtp2zzpc.3
    Explore at:
    Dataset updated
    May 24, 2022
    Authors
    Abdullah Al Taawab
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The repository contains 1,300 transliterated Bengali comments from which 647 are marked as positive comments, 653 comments are reviewed as negative. These comments are manually annotated and collected from facebook and youtube using data scapping tool. Positive comments are labeled as 0 and Negative comments are labeled as 1.

  16. m

    Social Communication Database

    • data.mendeley.com
    Updated Dec 9, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kailas PATIL (2019). Social Communication Database [Dataset]. http://doi.org/10.17632/wf5d5b2j52.1
    Explore at:
    Dataset updated
    Dec 9, 2019
    Authors
    Kailas PATIL
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset was generated by real world text communication in the group conversation. So, maintain the anonymity of the usernames and the mobile numbers of those who participated in the survey. This dataset can be used for education and research purpose only.

    The main contribution of this research in text mining is to bring into being a standard dataset for research purpose
    in the realm of mining the chat conversation. We have observed that this dataset has an immense density to be utilized for research purpose. Our applications based on this dataset, you can utilize this dataset into semantic search, sentiment analysis, semantic clustering of conversation, topic extraction, spam detection, etc. We wish to offer this dataset for others to collaborate and research on further possibilities.

    We have used our algorithms to extract the textual information from whatsapp logs and stored it in a sqlite database file named as "social conversation.db".

    This dataset contains 16225 text messages and 839 distinct users. We have considered 17 whatsapp groups for extracting the textual information.

    Paper: Analysis of foul language usage in social media text conversation Authors: Sumit Kawate and Kailas Patil In the Proceedings of the Int. J. Social Media and Interactive Learning Environments (IJSMILE), Vol: 05, Issue: 03, Pages: 227-251, Inderscience, 201 DOI: https://doi.org/10.1504/IJSMILE.2017.087976

    The data is stored in an .zip compressed archives. The uncompressed archive is in 6,020 KB (5.87 MB). Extract with any uncompressed standard software.

    The archive contains the following items:

    DATABASE/ | + Executable/ Directory containing executable files. | | + Social Conversation.db This file contains records of the database in .db format | + Source Code/ Directory containing source code files. | + Social Conversation (csv).csv This file contains records of the database in .csv format + Social Conversation (db).db This file contains records of the database in .db format + Social Conversation (html).html This file contains records of the database in .html format. | + Read Me/ Directory containing read me file. | | read me.txt This file contains detail information about dataset.

    The data format of the dataset are:

    =Table Name= -> CONVERSATION

    =Atttributes= =Meaning= USER_ID User id of the text message TEXT_MSG TextActual text message CONTACT_NUMBER Contact number of the user (We have masked the few digits of contact number of the user) DATE Date of the text message TIME Time of the text message

    =Atttributes= =Format= USER_ID User Id TEXT_MSG (text messsage in any format) CONTACT_NUMBER +contactnumber DATE dd/mm/yy TIME hh:mm AM or hh:mm PM

    =Atttributes= =Sample Example= USER_ID User 514 TEXT_MSG Any deal on formal shoes with prime CONTACT_NUMBER +919xxxxx927 DATE 04/12/17

    TIME 1:35 AM

  17. E

    Data from: Antibody Watch: Text Mining Antibody Specificity from the...

    • live.european-language-grid.eu
    • zenodo.org
    ms-excel
    Updated Nov 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Antibody Watch: Text Mining Antibody Specificity from the Literature [Dataset]. https://live.european-language-grid.eu/catalogue/corpus/7822
    Explore at:
    ms-excelAvailable download formats
    Dataset updated
    Nov 24, 2023
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Abstract Motivation: Antibodies are widely used reagents to test for expression of proteins. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results.Results: We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report any antibody that is nonspecific, and thus problematic. The second task is to link each of these snippets to one or more antibodies that the snippet referred to. We leveraged the Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.

  18. AI usage for text mining in Norway in 2023, by industry

    • statista.com
    Updated Mar 20, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2024). AI usage for text mining in Norway in 2023, by industry [Dataset]. https://www.statista.com/statistics/1456827/text-mining-ai-usage-norway/
    Explore at:
    Dataset updated
    Mar 20, 2024
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Norway
    Description

    In 2023, information and communication was the industry with the most usage of artificial intelligence (AI) for text mining in Norway with a share of 16 percent. Wholesale trade, except of motor vehicles and motorcycles was the industry in second place with a 5 percent share.

  19. m

    Enron Authorship Verification Corpus

    • data.mendeley.com
    • search.datacite.org
    Updated Sep 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oren Halvani (2018). Enron Authorship Verification Corpus [Dataset]. http://doi.org/10.17632/n77w7mygwg.2
    Explore at:
    Dataset updated
    Sep 10, 2018
    Authors
    Oren Halvani
    License

    http://www.gnu.org/licenses/gpl-3.0.en.htmlhttp://www.gnu.org/licenses/gpl-3.0.en.html

    Description

    ===========================================

    Type of corpus:

    The "Enron Authorship Verification Corpus" is a derivate of the well-known "Enron Email Dataset" [1], which has been used across different research domains beyong Authorship Verification (AV). The intention behind this corpus is to provide other researchers in the field of AV the opportunity to compare their results to each other.

    ===========================================

    Language:

    All texts are written in English.

    ===========================================

    Format of the corpus:

    The corpus was transformed in such a way to meet the same standardized format of the "PAN Authorship Identification corpora" [2]. It consists of 80 AV cases, evenly distributed regarding true (Y) and false (N) authorships, as well as the ground truth (Y/N) regarding all AV cases. Each AV case comprise up to 5 documents (plain text files), where 2-4 documents stem from a known author, while the 5th document has an unknown authorship and, thus, is the subject of verification. Each document has been written by a single author X and is mostly aggregated from several mails of X, in order to provide a sufficient length that captures X's writing style.

    ===========================================

    Preprocessing steps:

    All texts in the corpus were preprocessed by hand, which resulted in an overall processing time of more than 30 hours. The preprocessing includes de-duplication, normalization of utf-8 symbols as well as the removal of URLs, e-mail headers, signatures and other metadata. Beyond these, the texts themselves have been undergone a variety of cleaning procedures including the removal of greetings/closing formulas, (telephone) numbers, named entities (names of people, companies, locations, etc.), quotes and repetitions of identical characters/symbols and words. As a last preprocessing step, multiple successive blanks, newlines and tabs were substituted with a single blank.

    ===========================================

    Basic statistics:

    The length of each preprocessed text ranges from 2,200-5,000 characters. More precisely, the average length of each known document is 3976 characters, while the average length of each unknown document is 3899 characters.

    ===========================================

    Paper + Citation:

    https://link.springer.com/chapter/10.1007/978-3-319-98932-7_4

    ===========================================

    References:

    [1] https://www.cs.cmu.edu/~enron [2] http://pan.webis.de

  20. m

    Emerging Topics in Project Management Research: A text mining analysis of...

    • data.mendeley.com
    Updated Jan 13, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vito Giordano (2022). Emerging Topics in Project Management Research: A text mining analysis of Scientific Literautre [Dataset]. http://doi.org/10.17632/37g2cst34c.2
    Explore at:
    Dataset updated
    Jan 13, 2022
    Authors
    Vito Giordano
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains the text analysis performed with Natural Language Processing (NLP) techniques on scientific articles of Project Management (PM), collected from Scopus. The dataset contains the list of tokens extracted from abstract papers with the number of papers contains the token and the rate of growth. The tokens are used for identifying the emerging research topics in the PM literature.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy (2021). Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis [Dataset]. http://doi.org/10.18738/T8/UKJZ3E

Corpus of African Digital News from 600 Websites Formatted for Text Mining / Computational Text Analysis

Explore at:
application/gzip(8896500), application/gzip(5435878), csv(13256), application/gzip(10982329), application/gzip(10871486), application/gzip(10253389), application/gzip(4767585), pdf(118565), application/gzip(14010653), application/gzip(7694220), application/gzip(4829340), application/gzip(7887635), application/gzip(6466273), application/gzip(6721855), application/gzip(10042738), application/gzip(4731617), application/gzip(6310696), application/gzip(3494342), application/gzip(16196922), application/gzip(5264947), application/gzip(10277694), application/gzip(5982318), application/gzip(9675077), application/gzip(5277756), application/gzip(5565486), application/gzip(7850895), application/gzip(9296638), application/gzip(109496838), application/gzip(8052199), application/gzip(7811465), application/gzip(7789984), application/gzip(4388992), application/gzip(10732802), application/gzip(9913220), application/gzip(6911435)Available download formats
Dataset updated
Mar 12, 2021
Dataset provided by
Texas Data Repository
Authors
D. Madrid-Morales; D. Madrid-Morales; P. Lindner; M. Periyasamy; P. Lindner; M. Periyasamy
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered
Dec 6, 2020 - Jan 4, 2021
Description

This dataset includes a corpus 200,000+ news articles published by 600 African news organizations between December 4, 2020 and January 3, 2021. The texts have been pre-processed (punctuation and English stopwords have been removed, features have been lowercased, lemmatized and POS-tagged) and stored in commonly used formats for text mining/computational text analysis. Users are advised to read the documentation for an explanation of the data collection process. This dataset includes the following items: 31 tables (one per day) of lowercased and lemmatized tokens with the following additional variables: POS tags, document id, sentence id, token id and publication date (stored as a tibble). A single document-feature matrix (DFM) with raw counts of feature frequencies in each news article (stored as a quanteda dfm object). The DFM comes with the following metadata for each document: date of publication and source URL. A metadata table with the following fields: document id, publication date, source url, news source and country of the news source. A list of sources included in the course grouped by country name. All items are stored in formats readable in R. The documentation provides instructions on how to load the RDS files to R. If you decide to use the data for your own project, please do cite it using the information above. If you identify errors or missing sources, please contact us so that these can be addressed.

Search
Clear search
Close search
Google apps
Main menu