100+ datasets found
  1. Data Mining Assignment

    • kaggle.com
    Updated Oct 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esat Dalbudak (2024). Data Mining Assignment [Dataset]. https://www.kaggle.com/datasets/esatdalbudak/data-mining-assignment/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 30, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Esat Dalbudak
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Esat Dalbudak

    Released under MIT

    Contents

  2. w

    Dataset of book subjects that contain Learning Data Mining with Python

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Learning Data Mining with Python [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Learning+Data+Mining+with+Python&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 2 rows and is filtered where the books is Learning Data Mining with Python. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  3. Data from: Data Mining Using R:

    • kaggle.com
    Updated Jul 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science (2018). Data Mining Using R: [Dataset]. https://www.kaggle.com/ravali566/data-mining-using-r/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 2, 2018
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Data Science
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Data Science

    Released under CC0: Public Domain

    Contents

  4. Data from: Data Mining Project

    • kaggle.com
    Updated May 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Khanh Vương (2022). Data Mining Project [Dataset]. https://www.kaggle.com/khanhvng/data-mining-project/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 14, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Khanh Vương
    Description

    Dataset

    This dataset was created by Khanh Vương

    Contents

  5. w

    Dataset of book subjects that contain Spatial data mining : theory and...

    • workwithdata.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2024). Dataset of book subjects that contain Spatial data mining : theory and application [Dataset]. https://www.workwithdata.com/datasets/book-subjects?f=1&fcol0=j0-book&fop0=%3D&fval0=Spatial+data+mining+%3A+theory+and+application&j=1&j0=books
    Explore at:
    Dataset updated
    Nov 7, 2024
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about book subjects. It has 4 rows and is filtered where the books is Spatial data mining : theory and application. It features 10 columns including number of authors, number of books, earliest publication date, and latest publication date.

  6. w

    Dataset of books about Data mining-Social aspects

    • workwithdata.com
    Updated Apr 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Work With Data (2025). Dataset of books about Data mining-Social aspects [Dataset]. https://www.workwithdata.com/datasets/books?f=1&fcol0=j0-book_subject&fop0=%3D&fval0=Data+mining-Social+aspects&j=1&j0=book_subjects
    Explore at:
    Dataset updated
    Apr 17, 2025
    Dataset authored and provided by
    Work With Data
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is about books. It has 11 rows and is filtered where the book subjects is Data mining-Social aspects. It features 9 columns including author, publication date, language, and book publisher.

  7. f

    External utilities of items from Table 1.

    • plos.figshare.com
    xls
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo (2025). External utilities of items from Table 1. [Dataset]. http://doi.org/10.1371/journal.pone.0317427.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Privacy is as a critical issue in the age of data. Organizations and corporations who publicly share their data always have a major concern that their sensitive information may be leaked or extracted by rivals or attackers using data miners. High-utility itemset mining (HUIM) is an extension to frequent itemset mining (FIM) which deals with business data in the form of transaction databases, data that is also in danger of being stolen. To deal with this, a number of privacy-preserving data mining (PPDM) techniques have been introduced. An important topic in PPDM in the recent years is privacy-preserving utility mining (PPUM). The goal of PPUM is to protect the sensitive information, such as sensitive high-utility itemsets, in transaction databases, and make them undiscoverable for data mining techniques. However, available PPUM methods do not consider the generalization of items in databases (categories, classes, groups, etc.). These algorithms only consider the items at a specialized level, leaving the item combinations at a higher level vulnerable to attacks. The insights gained from higher abstraction levels are somewhat more valuable than those from lower levels since they contain the outlines of the data. To address this issue, this work suggests two PPUM algorithms, namely MLHProtector and FMLHProtector, to operate at all abstraction levels in a transaction database to protect them from data mining algorithms. Empirical experiments showed that both algorithms successfully protect the itemsets from being compromised by attackers.

  8. f

    MC% ratios at different abstraction levels l.

    • figshare.com
    xls
    Updated Feb 3, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo (2025). MC% ratios at different abstraction levels l. [Dataset]. http://doi.org/10.1371/journal.pone.0317427.t014
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Privacy is as a critical issue in the age of data. Organizations and corporations who publicly share their data always have a major concern that their sensitive information may be leaked or extracted by rivals or attackers using data miners. High-utility itemset mining (HUIM) is an extension to frequent itemset mining (FIM) which deals with business data in the form of transaction databases, data that is also in danger of being stolen. To deal with this, a number of privacy-preserving data mining (PPDM) techniques have been introduced. An important topic in PPDM in the recent years is privacy-preserving utility mining (PPUM). The goal of PPUM is to protect the sensitive information, such as sensitive high-utility itemsets, in transaction databases, and make them undiscoverable for data mining techniques. However, available PPUM methods do not consider the generalization of items in databases (categories, classes, groups, etc.). These algorithms only consider the items at a specialized level, leaving the item combinations at a higher level vulnerable to attacks. The insights gained from higher abstraction levels are somewhat more valuable than those from lower levels since they contain the outlines of the data. To address this issue, this work suggests two PPUM algorithms, namely MLHProtector and FMLHProtector, to operate at all abstraction levels in a transaction database to protect them from data mining algorithms. Empirical experiments showed that both algorithms successfully protect the itemsets from being compromised by attackers.

  9. Retrospective data mining project of Student Subject Experience Surveys from...

    • researchdata.edu.au
    Updated 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Monica Short; Environmental and Social Justice Research Group (2023). Retrospective data mining project of Student Subject Experience Surveys from WEL418 [Dataset]. https://researchdata.edu.au/retrospective-mining-project-surveys-wel418/2923246
    Explore at:
    Dataset updated
    2023
    Dataset provided by
    Charles Sturt Universityhttp://csu.edu.au/
    Authors
    Monica Short; Environmental and Social Justice Research Group
    Time period covered
    2014 - Jun 17, 2022
    Description

    This data is the set of responses to Student Subject Experience Surveys from WEL418 case management for two academics, Katrina Gersbach and Dr Monica Short for the sessions that they taught in the period 2014-17th June 2022.

  10. Data Mining Project 1 Sapfile

    • kaggle.com
    Updated Feb 5, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Prutchakorn (2019). Data Mining Project 1 Sapfile [Dataset]. https://www.kaggle.com/prutchakorn/data-mining-project-1-sapfile/notebooks
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 5, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Prutchakorn
    Description

    Dataset

    This dataset was created by Prutchakorn

    Contents

  11. Data Mining for IVHM using Sparse Binary Ensembles, Phase I

    • data.nasa.gov
    application/rdfxml +5
    Updated Jun 26, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2018). Data Mining for IVHM using Sparse Binary Ensembles, Phase I [Dataset]. https://data.nasa.gov/dataset/Data-Mining-for-IVHM-using-Sparse-Binary-Ensembles/qfus-evzq
    Explore at:
    xml, tsv, csv, application/rssxml, application/rdfxml, jsonAvailable download formats
    Dataset updated
    Jun 26, 2018
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    In response to NASA SBIR topic A1.05, "Data Mining for Integrated Vehicle Health Management", Michigan Aerospace Corporation (MAC) asserts that our unique SPADE (Sparse Processing Applied to Data Exploitation) technology meets a significant fraction of the stated criteria and has functionality that enables it to handle many applications within the aircraft lifecycle. SPADE distills input data into highly quantized features and uses MAC's novel techniques for constructing Ensembles of Decision Trees to develop extremely accurate diagnostic/prognostic models for classification, regression, clustering, anomaly detection and semi-supervised learning tasks. These techniques are currently being employed to do Threat Assessment for satellites in conjunction with researchers at the Air Force Research Lab. Significant advantages to this approach include: 1) completely data driven; 2) training and evaluation are faster than conventional methods; 3) operates effectively on huge datasets (> billion samples X > million features), 4) proven to be as accurate as state-of-the-art techniques in many significant real-world applications. The specific goals for Phase 1 will be to work with domain experts at NASA and with our partners Boeing, SpaceX and GMV Space Systems to delineate a subset of problems that are particularly well-suited to this approach and to determine requirements for deploying algorithms on platforms of opportunity.

  12. m

    Replication Data for: Topic modeling of the quality of guest’s experience...

    • data.mendeley.com
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raksmey Sann (2023). Replication Data for: Topic modeling of the quality of guest’s experience using latent Dirichlet allocation: western versus eastern perspectives [Dataset]. http://doi.org/10.17632/fcvhhjmyfj.1
    Explore at:
    Dataset updated
    Jan 18, 2023
    Authors
    Raksmey Sann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset includes replication data for the paper: Sann, R. and Lai, P.-C. (2023), "Topic modeling of the quality of guest’s experience using latent Dirichlet allocation: western versus eastern perspectives", Consumer Behavior in Tourism and Hospitality, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/CBTH-04-2022-0084

  13. d

    Replication Data for: Policy Diffusion: The Issue-Definition Stage

    • search.dataone.org
    Updated Nov 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno (2023). Replication Data for: Policy Diffusion: The Issue-Definition Stage [Dataset]. http://doi.org/10.7910/DVN/QEMNP1
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Gilardi, Fabrizio; Shipan, Charles R.; Wüest, Bruno
    Description

    We put forward a new approach to studying issue definition within the context of policy diffusion. Most studies of policy diffusion---which is the process by which policymaking in one government affects policymaking in other governments---have focused on policy adoptions. We shift the focus to an important but neglected aspect of this process: the issue-definition stage. We use topic models to estimate how policies are framed during this stage and how these frames are predicted by prior policy adoptions. Focusing on smoking restriction in U.S. states, our analysis draws upon an original dataset of over 52,000 paragraphs from newspapers covering 49 states between 1996 and 2013. We find that frames regarding the policy's concrete implications are predicted by prior adoptions in other states, while frames regarding its normative justifications are not. Our approach and findings open the way for a new perspective to studying policy diffusion in many different areas.

  14. s

    Data from: Joint Behavior-Topic Model for Microblogs

    • researchdata.smu.edu.sg
    bin
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    QIU Minghui; Feida ZHU; Jing JIANG (2023). Joint Behavior-Topic Model for Microblogs [Dataset]. http://doi.org/10.25440/smu.12062724.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    QIU Minghui; Feida ZHU; Jing JIANG
    License

    http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/InC/1.0/

    Description

    We propose an LDA-based behavior-topic model (B-LDA) which jointly models user topic interests and behavioral patterns. We focus the study of the model on on-line social network settings such as microblogs like Twitter where the textual content is relatively short but user interactions on them are rich.Related Publication: Qiu, M., Zhu, F., & Jiang, J. (2013). It is not just what we say, but how we say them: LDA-based behavior-topic model. In 2013 SIAM International Conference on Data Mining (SDM’13): 2-4 May, Austin, Texas (pp. 794-802). Philadelphia: SIAM. http://doi.org/10.1137/1.9781611972832.88

  15. Pokemon data mining 2020

    • kaggle.com
    Updated Jul 31, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    AJ Pass (2020). Pokemon data mining 2020 [Dataset]. https://www.kaggle.com/ajpass/pokemon-data-mining-2020/metadata
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 31, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    AJ Pass
    Description

    Context

    This dataset was obtained using a web scrapper made in this notebook as learning purposes for data mining and web scrapping:

    https://www.kaggle.com/ajpass/data-mining-web-scrapper-vol-1-pokedex

    Content

    Inside this dataset are the diferrent generations of pokemons with all their stats.

    Acknowledgements

    This dataset was come from the knowledge I learned following a tutorial a year ago and because I couldn't find it I made a version with what I remembered.

  16. P

    iKala Dataset

    • paperswithcode.com
    • opendatalab.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tak-Shing Chan; Tzu-Chun Yeh; Zhe-Cheng Fan; Hung-Wei Chen; Li Su; Yi-Hsuan Yang; Jyh-Shing Roger Jang, iKala Dataset [Dataset]. https://paperswithcode.com/dataset/ikala
    Explore at:
    Authors
    Tak-Shing Chan; Tzu-Chun Yeh; Zhe-Cheng Fan; Hung-Wei Chen; Li Su; Yi-Hsuan Yang; Jyh-Shing Roger Jang
    Description

    The iKala dataset is a singing voice separation dataset that comprises of 252 30-second excerpts sampled from 206 iKala songs (plus 100 hidden excerpts reserved for MIREX data mining contest). The music accompaniment and the singing voice are recorded at the left and right channels respectively. Additionally, the human-labeled pitch contours and timestamped lyrics are provided.

    This dataset is not available anymore.

    "T" as a sofa:

    The "T" horizontal strip can mimic the back of a sofa with a delicate cushion or details of the uphols or appliances with the color button.

    The "T" vertical strip can show a feet or arm of the sofa, shiny, yet firm.

    Merge "P":

    Put "P" next to "T", your curve to delicately with the top "T." It is intertwined. The circular part of "P" can show a cushion or a curved chair and synchronize the subject of furniture.

    Make sure "P" is visually relying on "T", which reflects the relationship of cohesion and balance.

    Coherence of "B" and "I":

    "B" can be aligned as a pair of cushions or a modern chair, with mild curves with glossy and modern aesthetics.

    "I" can be a symbol of a shiny furniture or a vertical light bar and completes the shapes without overburdess them.

    Color palette 4:

    Includes soft soil colors such as beige, top and gray shades, along with silent or silver gold tips to touch elegance.

    Consider a slope effect to enhance modernity, to keep colors elegant and complex.

    Connect the letters:

    Use the overlap or intertwined edges that the letters meet for the symbol of unity.

    The plan should allow viewers to distinguish each letter while feeling part of the same "structure".

    Background patterns:

    Use delicate geometric patterns or textures that mimic fabrics or furniture materials such as wood seeds or woven fibers.

    These patterns must remain minimalist and focus on highlighting the logo, while maintaining communication.

    While it deals with the subject of furniture and design, this concept conveys modernity, creativity and professional. If you like, I can create a draft design for better visualization.

  17. Data Mining - Insurance Claim

    • kaggle.com
    Updated Oct 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vinyas_Shreedhar0309 (2021). Data Mining - Insurance Claim [Dataset]. https://www.kaggle.com/datasets/vinyasshreedhar0309/data-mining-insurance-claim/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vinyas_Shreedhar0309
    Description

    Dataset

    This dataset was created by Vinyas_Shreedhar0309

    Contents

  18. Data from: Trends of Global Health Literacy Research (1995-2020): Analysis...

    • beta.ukdataservice.ac.uk
    Updated 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fengrui Hua (2021). Trends of Global Health Literacy Research (1995-2020): Analysis of Mapping Knowledge Domains Based on Citation Data Mining [Dataset]. http://doi.org/10.5255/ukda-sn-855097
    Explore at:
    Dataset updated
    2021
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    DataCitehttps://www.datacite.org/
    Authors
    Fengrui Hua
    Description

    This data could be used with CiteSpace to carry out a metric analysis of 9,492 health literacy papers included in Web of Science through mapping knowledge domains. The data processing is as follows: Publications with the subject term “Health Literacy” were searched in WoS, and the search was further optimized by the following conditions: language = English; document type = article + review. The number of search results was 9,888 (downloaded on September 19, 2020). Therefore, the period of the citing articles in our study is from January 1, 1995, to September 19, 2020. During the deduplication process, we excluded duplicate publications and articles with missing key fields, such as abstracts, keywords and references, resulting in 9,429 valid records for inclusion.

  19. l

    LScD (Leicester Scientific Dictionary)

    • figshare.le.ac.uk
    docx
    Updated Apr 15, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Neslihan Suzen (2020). LScD (Leicester Scientific Dictionary) [Dataset]. http://doi.org/10.25392/leicester.data.9746900.v3
    Explore at:
    docxAvailable download formats
    Dataset updated
    Apr 15, 2020
    Dataset provided by
    University of Leicester
    Authors
    Neslihan Suzen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Leicester
    Description

    LScD (Leicester Scientific Dictionary)April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk/suzenneslihan@hotmail.com)Supervised by Prof Alexander Gorban and Dr Evgeny Mirkes[Version 3] The third version of LScD (Leicester Scientific Dictionary) is created from the updated LSC (Leicester Scientific Corpus) - Version 2*. All pre-processing steps applied to build the new version of the dictionary are the same as in Version 2** and can be found in description of Version 2 below. We did not repeat the explanation. After pre-processing steps, the total number of unique words in the new version of the dictionary is 972,060. The files provided with this description are also same as described as for LScD Version 2 below.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v2** Suzen, Neslihan (2019): LScD (Leicester Scientific Dictionary). figshare. Dataset. https://doi.org/10.25392/leicester.data.9746900.v2[Version 2] Getting StartedThis document provides the pre-processing steps for creating an ordered list of words from the LSC (Leicester Scientific Corpus) [1] and the description of LScD (Leicester Scientific Dictionary). This dictionary is created to be used in future work on the quantification of the meaning of research texts. R code for producing the dictionary from LSC and instructions for usage of the code are available in [2]. The code can be also used for list of texts from other sources, amendments to the code may be required.LSC is a collection of abstracts of articles and proceeding papers published in 2014 and indexed by the Web of Science (WoS) database [3]. Each document contains title, list of authors, list of categories, list of research areas, and times cited. The corpus contains only documents in English. The corpus was collected in July 2018 and contains the number of citations from publication date to July 2018. The total number of documents in LSC is 1,673,824.LScD is an ordered list of words from texts of abstracts in LSC.The dictionary stores 974,238 unique words, is sorted by the number of documents containing the word in descending order. All words in the LScD are in stemmed form of words. The LScD contains the following information:1.Unique words in abstracts2.Number of documents containing each word3.Number of appearance of a word in the entire corpusProcessing the LSCStep 1.Downloading the LSC Online: Use of the LSC is subject to acceptance of request of the link by email. To access the LSC for research purposes, please email to ns433@le.ac.uk. The data are extracted from Web of Science [3]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.Step 2.Importing the Corpus to R: The full R code for processing the corpus can be found in the GitHub [2].All following steps can be applied for arbitrary list of texts from any source with changes of parameter. The structure of the corpus such as file format and names (also the position) of fields should be taken into account to apply our code. The organisation of CSV files of LSC is described in README file for LSC [1].Step 3.Extracting Abstracts and Saving Metadata: Metadata that include all fields in a document excluding abstracts and the field of abstracts are separated. Metadata are then saved as MetaData.R. Fields of metadata are: List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.Step 4.Text Pre-processing Steps on the Collection of Abstracts: In this section, we presented our approaches to pre-process abstracts of the LSC.1.Removing punctuations and special characters: This is the process of substitution of all non-alphanumeric characters by space. We did not substitute the character “-” in this step, because we need to keep words like “z-score”, “non-payment” and “pre-processing” in order not to lose the actual meaning of such words. A processing of uniting prefixes with words are performed in later steps of pre-processing.2.Lowercasing the text data: Lowercasing is performed to avoid considering same words like “Corpus”, “corpus” and “CORPUS” differently. Entire collection of texts are converted to lowercase.3.Uniting prefixes of words: Words containing prefixes joined with character “-” are united as a word. The list of prefixes united for this research are listed in the file “list_of_prefixes.csv”. The most of prefixes are extracted from [4]. We also added commonly used prefixes: ‘e’, ‘extra’, ‘per’, ‘self’ and ‘ultra’.4.Substitution of words: Some of words joined with “-” in the abstracts of the LSC require an additional process of substitution to avoid losing the meaning of the word before removing the character “-”. Some examples of such words are “z-test”, “well-known” and “chi-square”. These words have been substituted to “ztest”, “wellknown” and “chisquare”. Identification of such words is done by sampling of abstracts form LSC. The full list of such words and decision taken for substitution are presented in the file “list_of_substitution.csv”.5.Removing the character “-”: All remaining character “-” are replaced by space.6.Removing numbers: All digits which are not included in a word are replaced by space. All words that contain digits and letters are kept because alphanumeric characters such as chemical formula might be important for our analysis. Some examples are “co2”, “h2o” and “21st”.7.Stemming: Stemming is the process of converting inflected words into their word stem. This step results in uniting several forms of words with similar meaning into one form and also saving memory space and time [5]. All words in the LScD are stemmed to their word stem.8.Stop words removal: Stop words are words that are extreme common but provide little value in a language. Some common stop words in English are ‘I’, ‘the’, ‘a’ etc. We used ‘tm’ package in R to remove stop words [6]. There are 174 English stop words listed in the package.Step 5.Writing the LScD into CSV Format: There are 1,673,824 plain processed texts for further analysis. All unique words in the corpus are extracted and written in the file “LScD.csv”.The Organisation of the LScDThe total number of words in the file “LScD.csv” is 974,238. Each field is described below:Word: It contains unique words from the corpus. All words are in lowercase and their stem forms. The field is sorted by the number of documents that contain words in descending order.Number of Documents Containing the Word: In this content, binary calculation is used: if a word exists in an abstract then there is a count of 1. If the word exits more than once in a document, the count is still 1. Total number of document containing the word is counted as the sum of 1s in the entire corpus.Number of Appearance in Corpus: It contains how many times a word occurs in the corpus when the corpus is considered as one large document.Instructions for R CodeLScD_Creation.R is an R script for processing the LSC to create an ordered list of words from the corpus [2]. Outputs of the code are saved as RData file and in CSV format. Outputs of the code are:Metadata File: It includes all fields in a document excluding abstracts. Fields are List_of_Authors, Title, Categories, Research_Areas, Total_Times_Cited and Times_cited_in_Core_Collection.File of Abstracts: It contains all abstracts after pre-processing steps defined in the step 4.DTM: It is the Document Term Matrix constructed from the LSC[6]. Each entry of the matrix is the number of times the word occurs in the corresponding document.LScD: An ordered list of words from LSC as defined in the previous section.The code can be used by:1.Download the folder ‘LSC’, ‘list_of_prefixes.csv’ and ‘list_of_substitution.csv’2.Open LScD_Creation.R script3.Change parameters in the script: replace with the full path of the directory with source files and the full path of the directory to write output files4.Run the full code.References[1]N. Suzen. (2019). LSC (Leicester Scientific Corpus) [Dataset]. Available: https://doi.org/10.25392/leicester.data.9449639.v1[2]N. Suzen. (2019). LScD-LEICESTER SCIENTIFIC DICTIONARY CREATION. Available: https://github.com/neslihansuzen/LScD-LEICESTER-SCIENTIFIC-DICTIONARY-CREATION[3]Web of Science. (15 July). Available: https://apps.webofknowledge.com/[4]A. Thomas, "Common Prefixes, Suffixes and Roots," Center for Development and Learning, 2013.[5]C. Ramasubramanian and R. Ramya, "Effective pre-processing activities in text mining using improved porter’s stemming algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 12, pp. 4536-4538, 2013.[6]I. Feinerer, "Introduction to the tm Package Text Mining in R," Accessible en ligne: https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf, 2013.

  20. f

    Set of SML-HUIs.

    • plos.figshare.com
    xls
    Updated Feb 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo (2025). Set of SML-HUIs. [Dataset]. http://doi.org/10.1371/journal.pone.0317427.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Feb 3, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Loan T. T. Nguyen; Hoa Duong; An Mai; Bay Vo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Privacy is as a critical issue in the age of data. Organizations and corporations who publicly share their data always have a major concern that their sensitive information may be leaked or extracted by rivals or attackers using data miners. High-utility itemset mining (HUIM) is an extension to frequent itemset mining (FIM) which deals with business data in the form of transaction databases, data that is also in danger of being stolen. To deal with this, a number of privacy-preserving data mining (PPDM) techniques have been introduced. An important topic in PPDM in the recent years is privacy-preserving utility mining (PPUM). The goal of PPUM is to protect the sensitive information, such as sensitive high-utility itemsets, in transaction databases, and make them undiscoverable for data mining techniques. However, available PPUM methods do not consider the generalization of items in databases (categories, classes, groups, etc.). These algorithms only consider the items at a specialized level, leaving the item combinations at a higher level vulnerable to attacks. The insights gained from higher abstraction levels are somewhat more valuable than those from lower levels since they contain the outlines of the data. To address this issue, this work suggests two PPUM algorithms, namely MLHProtector and FMLHProtector, to operate at all abstraction levels in a transaction database to protect them from data mining algorithms. Empirical experiments showed that both algorithms successfully protect the itemsets from being compromised by attackers.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Esat Dalbudak (2024). Data Mining Assignment [Dataset]. https://www.kaggle.com/datasets/esatdalbudak/data-mining-assignment/code
Organization logo

Data Mining Assignment

Explore at:
101 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 30, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Esat Dalbudak
License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset

This dataset was created by Esat Dalbudak

Released under MIT

Contents

Search
Clear search
Close search
Google apps
Main menu