19 datasets found
  1. Datasets used for evaluating the customized version of Apriori algorithm.

    • plos.figshare.com
    zip
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Datasets used for evaluating the customized version of Apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)

  2. Number of association rules generated using the Apriori rule mining approach...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated using the Apriori rule mining approach with various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).

  3. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  4. f

    Data Sheet 1_From data to decision: empirical application of machine...

    • figshare.com
    csv
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang (2025). Data Sheet 1_From data to decision: empirical application of machine learning in public space planning along the Grand Canal, Shandong Province, China.csv [Dataset]. http://doi.org/10.3389/fbuil.2025.1643104.s001
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    Frontiers
    Authors
    Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Shandong, China, Jinghang Waterway
    Description

    IntroductionIn the process of urbanization, public space plays an increasingly important role in improving the livability and sustainability of cities. However, effectively understanding the preferences of different groups for public space and conducting reasonable planning integrated with environmental and infrastructure elements remains a challenge in urban planning. This is because traditional planning methods often fail to fully capture the detailed behavior of residents. Therefore, the purpose of this study was to explore the empirical application of machine learning technology to public space planning along the Grand Canal in Shandong Province (China), analyze the behavior patterns and preferences of residents regarding different public spaces, and thereby provide support for data - driven public space planning.MethodsBased on survey data from 1008 respondents across 4 cities, this study employed machine learning methods such as K - means clustering, association rule mining, and correlation analysis to investigate the relationships between visitor behavior and the environmental characteristics of public spaces.ResultsThe application of these methods yielded several important results. Cluster analysis identified three distinct groups: young and middle - aged local residents with a preference for accessibility, middle - aged and elderly groups enthusiastic about cultural engagement, and diverse transportation users with mixed spatial preferences. Additionally, association rule mining uncovered strong correlations between location types and perceived attributes such as cleanliness and aesthetics. Moreover, correlation analysis indicated statistically significant positive correlations between aesthetics and cleanliness, as well as between safety and cleanliness.DiscussionThis research offers valuable data - driven insights for public space planning and management. It demonstrates that machine learning can effectively identify and quantify key factors influencing public space use. As a result, it provides more accurate policy recommendations for urban planners and ensures that public space planning better meets the needs of different groups. For urban planners, the findings can guide the optimization of facility layouts for specific groups. For instance, adding canal cultural display nodes for cultural engagement groups and improving barrier - free facilities for groups with high accessibility needs, thereby enhancing the inclusiveness and utilization efficiency of public spaces.

  5. Table_1_Predicting Anxiety in Routine Palliative Care Using...

    • frontiersin.figshare.com
    xlsx
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oliver Haas; Luis Ignacio Lopera Gonzalez; Sonja Hofmann; Christoph Ostgathe; Andreas Maier; Eva Rothgang; Oliver Amft; Tobias Steigleder (2023). Table_1_Predicting Anxiety in Routine Palliative Care Using Bayesian-Inspired Association Rule Mining.XLSX [Dataset]. http://doi.org/10.3389/fdgth.2021.724049.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Oliver Haas; Luis Ignacio Lopera Gonzalez; Sonja Hofmann; Christoph Ostgathe; Andreas Maier; Eva Rothgang; Oliver Amft; Tobias Steigleder
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We propose a novel knowledge extraction method based on Bayesian-inspired association rule mining to classify anxiety in heterogeneous, routinely collected data from 9,924 palliative patients. The method extracts association rules mined using lift and local support as selection criteria. The extracted rules are used to assess the maximum evidence supporting and rejecting anxiety for each patient in the test set. We evaluated the predictive accuracy by calculating the area under the receiver operating characteristic curve (AUC). The evaluation produced an AUC of 0.89 and a set of 55 atomic rules with one item in the premise and the conclusion, respectively. The selected rules include variables like pain, nausea, and various medications. Our method outperforms the previous state of the art (AUC = 0.72). We analyzed the relevance and novelty of the mined rules. Palliative experts were asked about the correlation between variables in the data set and anxiety. By comparing expert answers with the retrieved rules, we grouped rules into expected and unexpected ones and found several rules for which experts' opinions and the data-backed rules differ, most notably with the patients' sex. The proposed method offers a novel way to predict anxiety in palliative settings using routinely collected data with an explainable and effective model based on Bayesian-inspired association rule mining. The extracted rules give further insight into potential knowledge gaps in the palliative care field.

  6. s

    Data from: Extended Comprehensive Study of Association Measures for Fault...

    • researchdata.smu.edu.sg
    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    LUCIA Lucia; David LO; Lingxiao JIANG; Ferdian THUNG; Aditya BUDI (2023). Data from: Extended Comprehensive Study of Association Measures for Fault Localization [Dataset]. http://doi.org/10.25440/smu.12062814.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    SMU Research Data Repository (RDR)
    Authors
    LUCIA Lucia; David LO; Lingxiao JIANG; Ferdian THUNG; Aditya BUDI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This record contains the underlying research data for the publication "Extended Comprehensive Study of Association Measures for Fault Localization" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/1818Spectrum-based fault localization is a promising approach to automatically locate root causes of failures quickly. Two well-known spectrum-based fault localization techniques, Tarantula and Ochiai, measure how likely a program element is a root cause of failures based on profiles of correct and failed program executions. These techniques are conceptually similar to association measures that have been proposed in statistics, data mining, and have been utilized to quantify the relationship strength between two variables of interest (e.g., the use of a medicine and the cure rate of a disease). In this paper, we view fault localization as a measurement of the relationship strength between the execution of program elements and program failures. We investigate the effectiveness of 40 association measures from the literature on locating bugs. Our empirical evaluations involve single-bug and multiple-bug programs. We find there is no best single measure for all cases. Klosgen and Ochiai outperform other measures for localizing single-bug programs. Although localizing multiple-bug programs, Added Value could localize the bugs with on average smallest percentage of inspected code, whereas a number of other measures have similar performance. The accuracies of the measures in localizing multi-bug programs are lower than single-bug programs, which provokes future research.

  7. Number of association rules generated from the HMP (male) dataset with...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated from the HMP (male) dataset with various run-time thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of association rules generated using the Apriori rule mining approach on the HMP (male) dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.

  8. D

    Data from: Correlation between the green-island phenotype and Wolbachia...

    • datasetcatalog.nlm.nih.gov
    • search.dataone.org
    • +2more
    Updated Jun 22, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lopez-Vaamonde, Carlos; Gutzwiller, Florence; Dedeine, Franck; Giron, David; Kaiser, Wilfried (2016). Correlation between the green-island phenotype and Wolbachia infections during the evolutionary diversification of Gracillariidae leaf-mining moths [Dataset]. http://doi.org/10.5061/dryad.q4747
    Explore at:
    Dataset updated
    Jun 22, 2016
    Authors
    Lopez-Vaamonde, Carlos; Gutzwiller, Florence; Dedeine, Franck; Giron, David; Kaiser, Wilfried
    Description

    Internally feeding herbivorous insects such as leaf miners have developed the ability to manipulate the physiology of their host plants in a way to best meet their metabolic needs and compensate for variation in food nutritional composition. For instance, some leaf miners can induce green-islands on yellow leaves in autumn, which are characterized by photosynthetically active green patches in otherwise senescing leaves. It has been shown that endosymbionts, and most likely bacteria of the genus Wolbachia, play an important role in green-island induction in the apple leaf-mining moth Phyllonorycter blancardella. However, it is currently not known how widespread is this moth-Wolbachia-plant interaction. Here, we studied the co-occurrence between Wolbachia and the green-island phenotype in 133 moth specimens belonging to 74 species of Lepidoptera including 60 Gracillariidae leaf miners. Using a combination of molecular phylogenies and ecological data (occurrence of green-islands), we show that the acquisitions of the green-island phenotype and Wolbachia infections have been associated through the evolutionary diversification of Gracillariidae. We also found intraspecific variability in both green-island formation and Wolbachia infection, with some species being able to form green-islands without being infected by Wolbachia. In addition, Wolbachia variants belonging to both A and B supergroups were found to be associated with green-island phenotype suggesting several independent origins of green-island induction. This study opens new prospects and raises new questions about the ecology and evolution of the tripartite association between Wolbachia, leaf miners, and their host plants.

  9. Data from: Empirical Study of the Relationship between Design Patterns and...

    • zenodo.org
    Updated Apr 2, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb; Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb (2020). Empirical Study of the Relationship between Design Patterns and Code Smells [Dataset]. http://doi.org/10.5281/zenodo.3633081
    Explore at:
    Dataset updated
    Apr 2, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb; Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Software systems are often developed in such a way that good practices in the object-oriented paradigm are not met, causing the occurrence of specific disharmonies, which are sometimes called code smells. Design patterns catalogue best practices for developing object-oriented software systems. Although code smells and design patterns are widely divergent, there might be a co-occurrence relation between them. The objective of this paper is to empirically evaluate if the presence of design patterns is related to the presence of code smells at different granularity levels. We performed an empirical replication study using 20 design patterns, and 13 code smells in ten small-size to medium-size, open-source Java-based systems. We applied statistical analysis and association rules. Results confirm that classes participating in design patterns have less smell-proneness and smell frequency than classes not participating in design patterns. We also noticed that every design pattern category act in the same way in terms of smell-proneness in the subject systems. However, we observed, based on the association rules learning and the proposed validation technique, that some patterns may be associated with certain smells in some cases. For instance, Command patterns can co-occur with God Class, Blob and External Duplication smell.

    The published data set contains the following:

    1. List of the selected systems (source code files)
    2. The P-MARt: the design pattern repository as XML for the selected systems.
    3. Data of design patterns and code smells: We processed this data by parsing the design pattern XML file and running the smell detection tool (inFusion).
    4. The data of the data mining analysis.
  10. Characteristics of real datsets and parameter settings.

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Peng Cheng; Chun-Wei Lin; Jeng-Shyang Pan (2023). Characteristics of real datsets and parameter settings. [Dataset]. http://doi.org/10.1371/journal.pone.0127834.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Peng Cheng; Chun-Wei Lin; Jeng-Shyang Pan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Characteristics of real datsets and parameter settings.

  11. Food Reviews - Text Mining & Sentiment Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
    Explore at:
    zip(1075643 bytes)Available download formats
    Dataset updated
    Aug 4, 2023
    Authors
    vikram amin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

    • We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

    Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

    • TEXT MODELLING
    • Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">
    • "like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)
    • "tast" does not have an association with any word with the set correlation limit
    • "flavor" has an association with the word "chip"(they appear about 27% of the time together)
    • Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">
    • element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.
    • Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">
    • This indicates that the entire food review document has a marginally positive score
    • Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">
    • (-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment
    • Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
  12. f

    Number of association rules generated from the prebiotics dataset with...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated from the prebiotics dataset with various run-time thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Number of association rules generated using the Apriori rule mining approach on the prebiotics dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.

  13. Table 1_Coping behavior toward occupational health risks among construction...

    • frontiersin.figshare.com
    docx
    Updated Sep 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xuesong Yang; Yuyan Ling; Liqun Wang; Yiqi Li; Mingrong Zeng (2025). Table 1_Coping behavior toward occupational health risks among construction workers: determinant identification using the COM-B model and data mining analysis.docx [Dataset]. http://doi.org/10.3389/fpubh.2025.1643332.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 12, 2025
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Xuesong Yang; Yuyan Ling; Liqun Wang; Yiqi Li; Mingrong Zeng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundChina has the largest construction workforce in the world but faces severe occupational health challenges. Coping behaviors related to occupational health risks (CBOHR) are key to mitigating these hazards but remain understudied.Materials and methodsA cross-sectional survey of 484 construction workers was conducted to assess Capability, Opportunity, Motivation, and Behavior using the COM-B model. Structural equation modeling (SEM) was employed to test mediating pathways, and association-rule mining (ARM) was used to identify determinants of high- and low-level CBOHR.ResultsThe results showed that the COM-B framework—comprising three modules (Capability, Opportunity, and Motivation) with 15 behavior change domains, and a Behavior module with eight specific CBOHRs—demonstrated satisfactory fit, reliability, and validity. Bootstrapping confirmed that Motivation fully mediates the relationship between Capability and Behavior and partially mediates the relationship between Opportunity and Behavior. ARM further identified key domains associated with high and low levels of CBOHR.ConclusionStrongly correlated item sets identified through association rule analysis revealed domains strongly linked to both high (and low) levels of each CBOHR. This study is the first to integrate the COM-B model with data mining in the context of occupational health, highlighting “motivation–values–policy” as actionable levers for CBOHR interventions. The findings provide preliminary evidence to support the development of scalable worker health programs.

  14. Supporting Information S1 - Exploiting SNP Correlations within Random Forest...

    • plos.figshare.com
    pdf
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vincent Botta; Gilles Louppe; Pierre Geurts; Louis Wehenkel (2023). Supporting Information S1 - Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies [Dataset]. http://doi.org/10.1371/journal.pone.0093379.s001
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Vincent Botta; Gilles Louppe; Pierre Geurts; Louis Wehenkel
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Supplementary figures and tables. T-Trees algorithm: pseudo-code and implementation details. (PDF)

  15. Data from: A network approach for discovering spatially associated objects

    • figshare.com
    xls
    Updated Apr 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tao Liang (2024). A network approach for discovering spatially associated objects [Dataset]. http://doi.org/10.6084/m9.figshare.25295452.v4
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Apr 13, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Tao Liang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A network approach for discovering spatially associated objects.First, the spatial network model is constructed through the spatial characteristics of the objects. Mining Topological Relationship Reachable Paths and Calculating Weighted Associations Between Objects by AWTRA. Finally, the Top-K number of associated objects are obtained based on the association ordering.

  16. Transactions.

    • plos.figshare.com
    xls
    Updated Jul 29, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongkang Ding (2025). Transactions. [Dataset]. http://doi.org/10.1371/journal.pone.0325925.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 29, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yongkang Ding
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Physical fitness refers to the health of all body functions, including cardiorespiratory endurance, muscle strength, flexibility, stamina, and body composition, which can help individuals effectively cope with daily activities and sports challenges. This paper explores the physical characteristics of basketball players, aiming to improve training effects through unique physical evaluation indicators and provide a theoretical framework for improving college basketball performance and training standards. The study adopted the Apriori association rule algorithm in data mining. First, the physical data of basketball players were collected and preprocessed. Then, frequent item sets were extracted through the association rule mining algorithm, association rules were generated, and the key factors affecting the physical performance of athletes were analyzed. The article’s results revealed the potential relationship between different physical characteristics and emphasized the application prospects of association rule mining in the physical evaluation of basketball players.

  17. f

    Data after data processing.

    • plos.figshare.com
    xls
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiaqi Wang; Xiaolong Jiang; Yizhou He; Biyu Guan; Chao Deng (2025). Data after data processing. [Dataset]. http://doi.org/10.1371/journal.pone.0332623.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 23, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jiaqi Wang; Xiaolong Jiang; Yizhou He; Biyu Guan; Chao Deng
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.

  18. Table2_Identification of a Gene Set Correlated With Immune Status in Ovarian...

    • frontiersin.figshare.com
    xls
    Updated Jun 10, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lili Fan; Han Lei; Ying Lin; Zhengwei Zhou; Guang Shu; Zhipeng Yan; Haotian Chen; Tianxiang Zhang; Gang Yin (2023). Table2_Identification of a Gene Set Correlated With Immune Status in Ovarian Cancer by Transcriptome-Wide Data Mining.xls [Dataset]. http://doi.org/10.3389/fmolb.2021.670666.s003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 10, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Lili Fan; Han Lei; Ying Lin; Zhengwei Zhou; Guang Shu; Zhipeng Yan; Haotian Chen; Tianxiang Zhang; Gang Yin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Immune checkpoint blocking (ICB) immunotherapy has achieved great success in the treatment of various malignancies. Although not have been approved for the treatment of ovarian cancer (OC), it has been actively tested for the treatment of OC. However, biomarkers that could indicate the immune status of OC and predict the response to ICB are rare. We downloaded RNAseq and clinical data of OC from The Cancer Genome Atlas (TCGA). Data analysis revealed both TMBhigh and immunityhigh were significantly related to better survival of OC. Up-regulated differentially expressed genes (Up-DEGs) were identified by analyzing the gene expression levels. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed in the “GSVA” and “limma” package in R software. The correlation of genes with overall survival was also analyzed by conducted Kaplan-Meier survival analysis. Four genes, CXCL13, FCRLA, MS4A1, and PLA2G2D were found positively correlated with better prognosis of OC and mainly involved in immune response-related pathways. Finally, TIMER and TIDE were used to predict gene immune function and its association with immunotherapy. We found that these four genes were positively correlated with better response to immune checkpoint blockade-based immunotherapy. Altogether, CXCL13, FCRLA, MS4A1, and PLA2G2D may be used as potential therapeutic genes for reflecting OC immune status and predicting response to immunotherapy.

  19. f

    Table 1_SciLinker: a large-scale text mining framework for mapping...

    • frontiersin.figshare.com
    xlsx
    Updated Mar 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dongyu Liu; Cora Ames; Shameer Khader; Franck Rapaport (2025). Table 1_SciLinker: a large-scale text mining framework for mapping associations among biological entities.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1528562.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 19, 2025
    Dataset provided by
    Frontiers
    Authors
    Dongyu Liu; Cora Ames; Shameer Khader; Franck Rapaport
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    IntroductionThe biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora.MethodsWe developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction.ResultsWe identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker’s ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature.ConclusionSciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Datasets used for evaluating the customized version of Apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.s001
Organization logo

Datasets used for evaluating the customized version of Apriori algorithm.

Related Article
Explore at:
zipAvailable download formats
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)

Search
Clear search
Close search
Google apps
Main menu