Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.
The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.
Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:
Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.
Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.
Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.
Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).
We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.
Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.
Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionIn the process of urbanization, public space plays an increasingly important role in improving the livability and sustainability of cities. However, effectively understanding the preferences of different groups for public space and conducting reasonable planning integrated with environmental and infrastructure elements remains a challenge in urban planning. This is because traditional planning methods often fail to fully capture the detailed behavior of residents. Therefore, the purpose of this study was to explore the empirical application of machine learning technology to public space planning along the Grand Canal in Shandong Province (China), analyze the behavior patterns and preferences of residents regarding different public spaces, and thereby provide support for data - driven public space planning.MethodsBased on survey data from 1008 respondents across 4 cities, this study employed machine learning methods such as K - means clustering, association rule mining, and correlation analysis to investigate the relationships between visitor behavior and the environmental characteristics of public spaces.ResultsThe application of these methods yielded several important results. Cluster analysis identified three distinct groups: young and middle - aged local residents with a preference for accessibility, middle - aged and elderly groups enthusiastic about cultural engagement, and diverse transportation users with mixed spatial preferences. Additionally, association rule mining uncovered strong correlations between location types and perceived attributes such as cleanliness and aesthetics. Moreover, correlation analysis indicated statistically significant positive correlations between aesthetics and cleanliness, as well as between safety and cleanliness.DiscussionThis research offers valuable data - driven insights for public space planning and management. It demonstrates that machine learning can effectively identify and quantify key factors influencing public space use. As a result, it provides more accurate policy recommendations for urban planners and ensures that public space planning better meets the needs of different groups. For urban planners, the findings can guide the optimization of facility layouts for specific groups. For instance, adding canal cultural display nodes for cultural engagement groups and improving barrier - free facilities for groups with high accessibility needs, thereby enhancing the inclusiveness and utilization efficiency of public spaces.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We propose a novel knowledge extraction method based on Bayesian-inspired association rule mining to classify anxiety in heterogeneous, routinely collected data from 9,924 palliative patients. The method extracts association rules mined using lift and local support as selection criteria. The extracted rules are used to assess the maximum evidence supporting and rejecting anxiety for each patient in the test set. We evaluated the predictive accuracy by calculating the area under the receiver operating characteristic curve (AUC). The evaluation produced an AUC of 0.89 and a set of 55 atomic rules with one item in the premise and the conclusion, respectively. The selected rules include variables like pain, nausea, and various medications. Our method outperforms the previous state of the art (AUC = 0.72). We analyzed the relevance and novelty of the mined rules. Palliative experts were asked about the correlation between variables in the data set and anxiety. By comparing expert answers with the retrieved rules, we grouped rules into expected and unexpected ones and found several rules for which experts' opinions and the data-backed rules differ, most notably with the patients' sex. The proposed method offers a novel way to predict anxiety in palliative settings using routinely collected data with an explainable and effective model based on Bayesian-inspired association rule mining. The extracted rules give further insight into potential knowledge gaps in the palliative care field.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This record contains the underlying research data for the publication "Extended Comprehensive Study of Association Measures for Fault Localization" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/1818Spectrum-based fault localization is a promising approach to automatically locate root causes of failures quickly. Two well-known spectrum-based fault localization techniques, Tarantula and Ochiai, measure how likely a program element is a root cause of failures based on profiles of correct and failed program executions. These techniques are conceptually similar to association measures that have been proposed in statistics, data mining, and have been utilized to quantify the relationship strength between two variables of interest (e.g., the use of a medicine and the cure rate of a disease). In this paper, we view fault localization as a measurement of the relationship strength between the execution of program elements and program failures. We investigate the effectiveness of 40 association measures from the literature on locating bugs. Our empirical evaluations involve single-bug and multiple-bug programs. We find there is no best single measure for all cases. Klosgen and Ochiai outperform other measures for localizing single-bug programs. Although localizing multiple-bug programs, Added Value could localize the bugs with on average smallest percentage of inspected code, whereas a number of other measures have similar performance. The accuracies of the measures in localizing multi-bug programs are lower than single-bug programs, which provokes future research.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of association rules generated using the Apriori rule mining approach on the HMP (male) dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.
Facebook
TwitterInternally feeding herbivorous insects such as leaf miners have developed the ability to manipulate the physiology of their host plants in a way to best meet their metabolic needs and compensate for variation in food nutritional composition. For instance, some leaf miners can induce green-islands on yellow leaves in autumn, which are characterized by photosynthetically active green patches in otherwise senescing leaves. It has been shown that endosymbionts, and most likely bacteria of the genus Wolbachia, play an important role in green-island induction in the apple leaf-mining moth Phyllonorycter blancardella. However, it is currently not known how widespread is this moth-Wolbachia-plant interaction. Here, we studied the co-occurrence between Wolbachia and the green-island phenotype in 133 moth specimens belonging to 74 species of Lepidoptera including 60 Gracillariidae leaf miners. Using a combination of molecular phylogenies and ecological data (occurrence of green-islands), we show that the acquisitions of the green-island phenotype and Wolbachia infections have been associated through the evolutionary diversification of Gracillariidae. We also found intraspecific variability in both green-island formation and Wolbachia infection, with some species being able to form green-islands without being infected by Wolbachia. In addition, Wolbachia variants belonging to both A and B supergroups were found to be associated with green-island phenotype suggesting several independent origins of green-island induction. This study opens new prospects and raises new questions about the ecology and evolution of the tripartite association between Wolbachia, leaf miners, and their host plants.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Software systems are often developed in such a way that good practices in the object-oriented paradigm are not met, causing the occurrence of specific disharmonies, which are sometimes called code smells. Design patterns catalogue best practices for developing object-oriented software systems. Although code smells and design patterns are widely divergent, there might be a co-occurrence relation between them. The objective of this paper is to empirically evaluate if the presence of design patterns is related to the presence of code smells at different granularity levels. We performed an empirical replication study using 20 design patterns, and 13 code smells in ten small-size to medium-size, open-source Java-based systems. We applied statistical analysis and association rules. Results confirm that classes participating in design patterns have less smell-proneness and smell frequency than classes not participating in design patterns. We also noticed that every design pattern category act in the same way in terms of smell-proneness in the subject systems. However, we observed, based on the association rules learning and the proposed validation technique, that some patterns may be associated with certain smells in some cases. For instance, Command patterns can co-occur with God Class, Blob and External Duplication smell.
The published data set contains the following:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Characteristics of real datsets and parameter settings.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?
Steps:
- Set the working directory and read the data.
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt="">
- Data cleaning. Check for missing values and data types of variables
- Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer")
- TEXT ACQUISITION and AGGREGATION. Create corpus.
- TEXT PRE-PROCESSING. Cleaning the text
- Replace special characters with " ". We use the tm_map function for this purpose
- make all the alphabets lower case
- remove punctuations
- remove whitespace
- remove stopwords
- remove numbers
- stem the document
- create term document matrix
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt="">
- convert into matrix and find out frequency of words
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt="">
- convert into a data frame
- TEXT EXPLORATION find out the words which appear most frequently and least frequently
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt="">
- Create Wordcloud
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Number of association rules generated using the Apriori rule mining approach on the prebiotics dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundChina has the largest construction workforce in the world but faces severe occupational health challenges. Coping behaviors related to occupational health risks (CBOHR) are key to mitigating these hazards but remain understudied.Materials and methodsA cross-sectional survey of 484 construction workers was conducted to assess Capability, Opportunity, Motivation, and Behavior using the COM-B model. Structural equation modeling (SEM) was employed to test mediating pathways, and association-rule mining (ARM) was used to identify determinants of high- and low-level CBOHR.ResultsThe results showed that the COM-B framework—comprising three modules (Capability, Opportunity, and Motivation) with 15 behavior change domains, and a Behavior module with eight specific CBOHRs—demonstrated satisfactory fit, reliability, and validity. Bootstrapping confirmed that Motivation fully mediates the relationship between Capability and Behavior and partially mediates the relationship between Opportunity and Behavior. ARM further identified key domains associated with high and low levels of CBOHR.ConclusionStrongly correlated item sets identified through association rule analysis revealed domains strongly linked to both high (and low) levels of each CBOHR. This study is the first to integrate the COM-B model with data mining in the context of occupational health, highlighting “motivation–values–policy” as actionable levers for CBOHR interventions. The findings provide preliminary evidence to support the development of scalable worker health programs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary figures and tables. T-Trees algorithm: pseudo-code and implementation details. (PDF)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A network approach for discovering spatially associated objects.First, the spatial network model is constructed through the spatial characteristics of the objects. Mining Topological Relationship Reachable Paths and Calculating Weighted Associations Between Objects by AWTRA. Finally, the Top-K number of associated objects are obtained based on the association ordering.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Physical fitness refers to the health of all body functions, including cardiorespiratory endurance, muscle strength, flexibility, stamina, and body composition, which can help individuals effectively cope with daily activities and sports challenges. This paper explores the physical characteristics of basketball players, aiming to improve training effects through unique physical evaluation indicators and provide a theoretical framework for improving college basketball performance and training standards. The study adopted the Apriori association rule algorithm in data mining. First, the physical data of basketball players were collected and preprocessed. Then, frequent item sets were extracted through the association rule mining algorithm, association rules were generated, and the key factors affecting the physical performance of athletes were analyzed. The article’s results revealed the potential relationship between different physical characteristics and emphasized the application prospects of association rule mining in the physical evaluation of basketball players.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Immune checkpoint blocking (ICB) immunotherapy has achieved great success in the treatment of various malignancies. Although not have been approved for the treatment of ovarian cancer (OC), it has been actively tested for the treatment of OC. However, biomarkers that could indicate the immune status of OC and predict the response to ICB are rare. We downloaded RNAseq and clinical data of OC from The Cancer Genome Atlas (TCGA). Data analysis revealed both TMBhigh and immunityhigh were significantly related to better survival of OC. Up-regulated differentially expressed genes (Up-DEGs) were identified by analyzing the gene expression levels. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed in the “GSVA” and “limma” package in R software. The correlation of genes with overall survival was also analyzed by conducted Kaplan-Meier survival analysis. Four genes, CXCL13, FCRLA, MS4A1, and PLA2G2D were found positively correlated with better prognosis of OC and mainly involved in immune response-related pathways. Finally, TIMER and TIDE were used to predict gene immune function and its association with immunotherapy. We found that these four genes were positively correlated with better response to immune checkpoint blockade-based immunotherapy. Altogether, CXCL13, FCRLA, MS4A1, and PLA2G2D may be used as potential therapeutic genes for reflecting OC immune status and predicting response to immunotherapy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
IntroductionThe biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora.MethodsWe developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction.ResultsWe identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker’s ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature.ConclusionSciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)