19 datasets found

Datasets used for evaluating the customized version of Apriori algorithm.
plos.figshare.com
zip
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Datasets used for evaluating the customized version of Apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.s001
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)
Number of association rules generated using the Apriori rule mining approach...
plos.figshare.com
figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated using the Apriori rule mining approach with various datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.t001
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
f
Data Sheet 1_From data to decision: empirical application of machine...
figshare.com
csv
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang (2025). Data Sheet 1_From data to decision: empirical application of machine learning in public space planning along the Grand Canal, Shandong Province, China.csv [Dataset]. http://doi.org/10.3389/fbuil.2025.1643104.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fbuil.2025.1643104.s001
Dataset updated
Oct 17, 2025
Dataset provided by
Frontiers
Authors
Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Shandong, China, Jinghang Waterway
Description
IntroductionIn the process of urbanization, public space plays an increasingly important role in improving the livability and sustainability of cities. However, effectively understanding the preferences of different groups for public space and conducting reasonable planning integrated with environmental and infrastructure elements remains a challenge in urban planning. This is because traditional planning methods often fail to fully capture the detailed behavior of residents. Therefore, the purpose of this study was to explore the empirical application of machine learning technology to public space planning along the Grand Canal in Shandong Province (China), analyze the behavior patterns and preferences of residents regarding different public spaces, and thereby provide support for data - driven public space planning.MethodsBased on survey data from 1008 respondents across 4 cities, this study employed machine learning methods such as K - means clustering, association rule mining, and correlation analysis to investigate the relationships between visitor behavior and the environmental characteristics of public spaces.ResultsThe application of these methods yielded several important results. Cluster analysis identified three distinct groups: young and middle - aged local residents with a preference for accessibility, middle - aged and elderly groups enthusiastic about cultural engagement, and diverse transportation users with mixed spatial preferences. Additionally, association rule mining uncovered strong correlations between location types and perceived attributes such as cleanliness and aesthetics. Moreover, correlation analysis indicated statistically significant positive correlations between aesthetics and cleanliness, as well as between safety and cleanliness.DiscussionThis research offers valuable data - driven insights for public space planning and management. It demonstrates that machine learning can effectively identify and quantify key factors influencing public space use. As a result, it provides more accurate policy recommendations for urban planners and ensures that public space planning better meets the needs of different groups. For urban planners, the findings can guide the optimization of facility layouts for specific groups. For instance, adding canal cultural display nodes for cultural engagement groups and improving barrier - free facilities for groups with high accessibility needs, thereby enhancing the inclusiveness and utilization efficiency of public spaces.
Table_1_Predicting Anxiety in Routine Palliative Care Using...
frontiersin.figshare.com
xlsx
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oliver Haas; Luis Ignacio Lopera Gonzalez; Sonja Hofmann; Christoph Ostgathe; Andreas Maier; Eva Rothgang; Oliver Amft; Tobias Steigleder (2023). Table_1_Predicting Anxiety in Routine Palliative Care Using Bayesian-Inspired Association Rule Mining.XLSX [Dataset]. http://doi.org/10.3389/fdgth.2021.724049.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fdgth.2021.724049.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Oliver Haas; Luis Ignacio Lopera Gonzalez; Sonja Hofmann; Christoph Ostgathe; Andreas Maier; Eva Rothgang; Oliver Amft; Tobias Steigleder
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
We propose a novel knowledge extraction method based on Bayesian-inspired association rule mining to classify anxiety in heterogeneous, routinely collected data from 9,924 palliative patients. The method extracts association rules mined using lift and local support as selection criteria. The extracted rules are used to assess the maximum evidence supporting and rejecting anxiety for each patient in the test set. We evaluated the predictive accuracy by calculating the area under the receiver operating characteristic curve (AUC). The evaluation produced an AUC of 0.89 and a set of 55 atomic rules with one item in the premise and the conclusion, respectively. The selected rules include variables like pain, nausea, and various medications. Our method outperforms the previous state of the art (AUC = 0.72). We analyzed the relevance and novelty of the mined rules. Palliative experts were asked about the correlation between variables in the data set and anxiety. By comparing expert answers with the retrieved rules, we grouped rules into expected and unexpected ones and found several rules for which experts' opinions and the data-backed rules differ, most notably with the patients' sex. The proposed method offers a novel way to predict anxiety in palliative settings using routinely collected data with an explainable and effective model based on Bayesian-inspired association rule mining. The extracted rules give further insight into potential knowledge gaps in the palliative care field.
s
Data from: Extended Comprehensive Study of Association Measures for Fault...
researchdata.smu.edu.sg
figshare.com
zip
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
LUCIA Lucia; David LO; Lingxiao JIANG; Ferdian THUNG; Aditya BUDI (2023). Data from: Extended Comprehensive Study of Association Measures for Fault Localization [Dataset]. http://doi.org/10.25440/smu.12062814.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.25440/smu.12062814.v1
Dataset updated
May 30, 2023
Dataset provided by
SMU Research Data Repository (RDR)
Authors
LUCIA Lucia; David LO; Lingxiao JIANG; Ferdian THUNG; Aditya BUDI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This record contains the underlying research data for the publication "Extended Comprehensive Study of Association Measures for Fault Localization" and the full-text is available from: https://ink.library.smu.edu.sg/sis_research/1818Spectrum-based fault localization is a promising approach to automatically locate root causes of failures quickly. Two well-known spectrum-based fault localization techniques, Tarantula and Ochiai, measure how likely a program element is a root cause of failures based on profiles of correct and failed program executions. These techniques are conceptually similar to association measures that have been proposed in statistics, data mining, and have been utilized to quantify the relationship strength between two variables of interest (e.g., the use of a medicine and the cure rate of a disease). In this paper, we view fault localization as a measurement of the relationship strength between the execution of program elements and program failures. We investigate the effectiveness of 40 association measures from the literature on locating bugs. Our empirical evaluations involve single-bug and multiple-bug programs. We find there is no best single measure for all cases. Klosgen and Ochiai outperform other measures for localizing single-bug programs. Although localizing multiple-bug programs, Added Value could localize the bugs with on average smallest percentage of inspected code, whereas a number of other measures have similar performance. The accuracies of the measures in localizing multi-bug programs are lower than single-bug programs, which provokes future research.
Number of association rules generated from the HMP (male) dataset with...
plos.figshare.com
xls
Updated Jun 1, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated from the HMP (male) dataset with various run-time thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.t003
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of association rules generated using the Apriori rule mining approach on the HMP (male) dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.
D
Data from: Correlation between the green-island phenotype and Wolbachia...
datasetcatalog.nlm.nih.gov
search.dataone.org
+2more
Updated Jun 22, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lopez-Vaamonde, Carlos; Gutzwiller, Florence; Dedeine, Franck; Giron, David; Kaiser, Wilfried (2016). Correlation between the green-island phenotype and Wolbachia infections during the evolutionary diversification of Gracillariidae leaf-mining moths [Dataset]. http://doi.org/10.5061/dryad.q4747
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.q4747
Dataset updated
Jun 22, 2016
Authors
Lopez-Vaamonde, Carlos; Gutzwiller, Florence; Dedeine, Franck; Giron, David; Kaiser, Wilfried
Description
Internally feeding herbivorous insects such as leaf miners have developed the ability to manipulate the physiology of their host plants in a way to best meet their metabolic needs and compensate for variation in food nutritional composition. For instance, some leaf miners can induce green-islands on yellow leaves in autumn, which are characterized by photosynthetically active green patches in otherwise senescing leaves. It has been shown that endosymbionts, and most likely bacteria of the genus Wolbachia, play an important role in green-island induction in the apple leaf-mining moth Phyllonorycter blancardella. However, it is currently not known how widespread is this moth-Wolbachia-plant interaction. Here, we studied the co-occurrence between Wolbachia and the green-island phenotype in 133 moth specimens belonging to 74 species of Lepidoptera including 60 Gracillariidae leaf miners. Using a combination of molecular phylogenies and ecological data (occurrence of green-islands), we show that the acquisitions of the green-island phenotype and Wolbachia infections have been associated through the evolutionary diversification of Gracillariidae. We also found intraspecific variability in both green-island formation and Wolbachia infection, with some species being able to form green-islands without being infected by Wolbachia. In addition, Wolbachia variants belonging to both A and B supergroups were found to be associated with green-island phenotype suggesting several independent origins of green-island induction. This study opens new prospects and raises new questions about the ecology and evolution of the tripartite association between Wolbachia, leaf miners, and their host plants.
Data from: Empirical Study of the Relationship between Design Patterns and...
zenodo.org
Updated Apr 2, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb; Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb (2020). Empirical Study of the Relationship between Design Patterns and Code Smells [Dataset]. http://doi.org/10.5281/zenodo.3633081
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.3633081
Dataset updated
Apr 2, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb; Mahmoud Alfadel; Khalid Al-Jasser; Mohammad Alshayeb
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Software systems are often developed in such a way that good practices in the object-oriented paradigm are not met, causing the occurrence of specific disharmonies, which are sometimes called code smells. Design patterns catalogue best practices for developing object-oriented software systems. Although code smells and design patterns are widely divergent, there might be a co-occurrence relation between them. The objective of this paper is to empirically evaluate if the presence of design patterns is related to the presence of code smells at different granularity levels. We performed an empirical replication study using 20 design patterns, and 13 code smells in ten small-size to medium-size, open-source Java-based systems. We applied statistical analysis and association rules. Results confirm that classes participating in design patterns have less smell-proneness and smell frequency than classes not participating in design patterns. We also noticed that every design pattern category act in the same way in terms of smell-proneness in the subject systems. However, we observed, based on the association rules learning and the proposed validation technique, that some patterns may be associated with certain smells in some cases. For instance, Command patterns can co-occur with God Class, Blob and External Duplication smell.

The published data set contains the following:

List of the selected systems (source code files)

The P-MARt: the design pattern repository as XML for the selected systems.

Data of design patterns and code smells: We processed this data by parsing the design pattern XML file and running the smell detection tool (inFusion).

The data of the data mining analysis.
Characteristics of real datsets and parameter settings.
plos.figshare.com
xls
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Peng Cheng; Chun-Wei Lin; Jeng-Shyang Pan (2023). Characteristics of real datsets and parameter settings. [Dataset]. http://doi.org/10.1371/journal.pone.0127834.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0127834.t002
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Peng Cheng; Chun-Wei Lin; Jeng-Shyang Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Characteristics of real datsets and parameter settings.
Food Reviews - Text Mining & Sentiment Analysis
kaggle.com
zip
Updated Aug 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Food Reviews - Text Mining & Sentiment Analysis [Dataset]. https://www.kaggle.com/datasets/vikramamin/food-reviews-text-mining-and-sentiment-analysis
Explore at:
zip(1075643 bytes)Available download formats
Dataset updated
Aug 4, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Brief Description: - The Chief Marketing Officer (CMO) of Healthy Foods Inc. wants to understand customer sentiments about the specialty foods that the company offers. This information has been collected through customer reviews on their website. Dataset consists of about 5000 reviews. They want the answers to the following questions: 1. What are the most frequently used words in the customer reviews? 2. How can the data be prepared for text analysis? 3. What are the overall sentiments towards the products?

We will be using text mining and sentiment analysis (R programming) to offer insights to the CMO with regards to the food reviews

Steps: - Set the working directory and read the data. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fd7ec6c7460b58ae39c96d5431cca2d37%2FPicture1.png?generation=1691146783504075&alt=media" alt=""> - Data cleaning. Check for missing values and data types of variables - Run the required libraries ("tm", "SnowballC", "dplyr", "sentimentr", "wordcloud2", "RColorBrewer") - TEXT ACQUISITION and AGGREGATION. Create corpus. - TEXT PRE-PROCESSING. Cleaning the text - Replace special characters with " ". We use the tm_map function for this purpose - make all the alphabets lower case - remove punctuations - remove whitespace - remove stopwords - remove numbers - stem the document - create term document matrix https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F0508dfd5df9b1ed2885e1eea35b84f30%2FPicture2.png?generation=1691147153582115&alt=media" alt=""> - convert into matrix and find out frequency of words https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Febc729e81068856dec368667c5758995%2FPicture3.png?generation=1691147243385812&alt=media" alt=""> - convert into a data frame - TEXT EXPLORATION find out the words which appear most frequently and least frequently https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F33cf5decc039baf96dbe86dd6964792a%2FTop%205%20frequent%20words.jpeg?generation=1691147382783191&alt=media" alt=""> - Create Wordcloud

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F99f1147bd9e9a4e6bb35686b015fc714%2FWordCloud.png?generation=1691147502824379&alt=media" alt="">

TEXT MODELLING

Word association between two words which tend to appear more number of times. Here we try to find the association for the top three occurring words "like", "tast", "flavor" by setting a correlation limit of 0.2 https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fbfdbfbe28a30012f0e7ab54d6185c223%2FPicture4.png?generation=1691147754149529&alt=media" alt="">

"like" has an association with "realli" (they appear about 25% of the time together), dont (24%), one(21%)

"tast" does not have an association with any word with the set correlation limit

"flavor" has an association with the word "chip"(they appear about 27% of the time together)

Sentiment analysis https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2Fa5da1dd46a60494ec9b26fa1a08b2087%2FPicture5.png?generation=1691147897889137&alt=media" alt="">

element_id refers to the Review No and sentence_id refers to the Sentence No in the review , word_count refers to the number of words part of that sentence in that review. Sentiment would be either positive or negative.

Let us find out the overall sentiment score of all the reviews https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F6fce0e810d47ea8864ebac58eca1be99%2FPicture6.png?generation=1691148149575056&alt=media" alt="">

This indicates that the entire food review document has a marginally positive score

Let us find out the sentiment score for each of the 5000 reviews. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F10868729%2F5b7861d5ebc3881483dd65a8385a539c%2FPicture7.png?generation=1691148278877972&alt=media" alt="">

(-1) indicates the most extreme negative sentiment and (+1) indicates the most extreme positive sentiment

Let us create a separate data frame for all the negative sentiments. In total there are 726 negative sentiments out of the total 5000 reviews (approx 15%).
f
Number of association rules generated from the prebiotics dataset with...
plos.figshare.com
figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated from the prebiotics dataset with various run-time thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of association rules generated using the Apriori rule mining approach on the prebiotics dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.
Table 1_Coping behavior toward occupational health risks among construction...
frontiersin.figshare.com
docx
Updated Sep 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xuesong Yang; Yuyan Ling; Liqun Wang; Yiqi Li; Mingrong Zeng (2025). Table 1_Coping behavior toward occupational health risks among construction workers: determinant identification using the COM-B model and data mining analysis.docx [Dataset]. http://doi.org/10.3389/fpubh.2025.1643332.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2025.1643332.s001
Dataset updated
Sep 12, 2025
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Xuesong Yang; Yuyan Ling; Liqun Wang; Yiqi Li; Mingrong Zeng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundChina has the largest construction workforce in the world but faces severe occupational health challenges. Coping behaviors related to occupational health risks (CBOHR) are key to mitigating these hazards but remain understudied.Materials and methodsA cross-sectional survey of 484 construction workers was conducted to assess Capability, Opportunity, Motivation, and Behavior using the COM-B model. Structural equation modeling (SEM) was employed to test mediating pathways, and association-rule mining (ARM) was used to identify determinants of high- and low-level CBOHR.ResultsThe results showed that the COM-B framework—comprising three modules (Capability, Opportunity, and Motivation) with 15 behavior change domains, and a Behavior module with eight specific CBOHRs—demonstrated satisfactory fit, reliability, and validity. Bootstrapping confirmed that Motivation fully mediates the relationship between Capability and Behavior and partially mediates the relationship between Opportunity and Behavior. ARM further identified key domains associated with high and low levels of CBOHR.ConclusionStrongly correlated item sets identified through association rule analysis revealed domains strongly linked to both high (and low) levels of each CBOHR. This study is the first to integrate the COM-B model with data mining in the context of occupational health, highlighting “motivation–values–policy” as actionable levers for CBOHR interventions. The findings provide preliminary evidence to support the development of scalable worker health programs.
Supporting Information S1 - Exploiting SNP Correlations within Random Forest...
plos.figshare.com
pdf
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vincent Botta; Gilles Louppe; Pierre Geurts; Louis Wehenkel (2023). Supporting Information S1 - Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies [Dataset]. http://doi.org/10.1371/journal.pone.0093379.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0093379.s001
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Vincent Botta; Gilles Louppe; Pierre Geurts; Louis Wehenkel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary figures and tables. T-Trees algorithm: pseudo-code and implementation details. (PDF)
Data from: A network approach for discovering spatially associated objects
figshare.com
xls
Updated Apr 13, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tao Liang (2024). A network approach for discovering spatially associated objects [Dataset]. http://doi.org/10.6084/m9.figshare.25295452.v4
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25295452.v4
Dataset updated
Apr 13, 2024
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Tao Liang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A network approach for discovering spatially associated objects.First, the spatial network model is constructed through the spatial characteristics of the objects. Mining Topological Relationship Reachable Paths and Calculating Weighted Associations Between Objects by AWTRA. Finally, the Top-K number of associated objects are obtained based on the association ordering.
Transactions.
plos.figshare.com
xls
Updated Jul 29, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yongkang Ding (2025). Transactions. [Dataset]. http://doi.org/10.1371/journal.pone.0325925.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325925.t002
Dataset updated
Jul 29, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Yongkang Ding
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Physical fitness refers to the health of all body functions, including cardiorespiratory endurance, muscle strength, flexibility, stamina, and body composition, which can help individuals effectively cope with daily activities and sports challenges. This paper explores the physical characteristics of basketball players, aiming to improve training effects through unique physical evaluation indicators and provide a theoretical framework for improving college basketball performance and training standards. The study adopted the Apriori association rule algorithm in data mining. First, the physical data of basketball players were collected and preprocessed. Then, frequent item sets were extracted through the association rule mining algorithm, association rules were generated, and the key factors affecting the physical performance of athletes were analyzed. The article’s results revealed the potential relationship between different physical characteristics and emphasized the application prospects of association rule mining in the physical evaluation of basketball players.
f
Data after data processing.
plos.figshare.com
xls
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jiaqi Wang; Xiaolong Jiang; Yizhou He; Biyu Guan; Chao Deng (2025). Data after data processing. [Dataset]. http://doi.org/10.1371/journal.pone.0332623.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0332623.t003
Dataset updated
Sep 23, 2025
Dataset provided by
PLOS ONE
Authors
Jiaqi Wang; Xiaolong Jiang; Yizhou He; Biyu Guan; Chao Deng
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.
Table2_Identification of a Gene Set Correlated With Immune Status in Ovarian...
frontiersin.figshare.com
xls
Updated Jun 10, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lili Fan; Han Lei; Ying Lin; Zhengwei Zhou; Guang Shu; Zhipeng Yan; Haotian Chen; Tianxiang Zhang; Gang Yin (2023). Table2_Identification of a Gene Set Correlated With Immune Status in Ovarian Cancer by Transcriptome-Wide Data Mining.xls [Dataset]. http://doi.org/10.3389/fmolb.2021.670666.s003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.3389/fmolb.2021.670666.s003
Dataset updated
Jun 10, 2023
Dataset provided by
Frontiers Mediahttp://www.frontiersin.org/
Authors
Lili Fan; Han Lei; Ying Lin; Zhengwei Zhou; Guang Shu; Zhipeng Yan; Haotian Chen; Tianxiang Zhang; Gang Yin
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Immune checkpoint blocking (ICB) immunotherapy has achieved great success in the treatment of various malignancies. Although not have been approved for the treatment of ovarian cancer (OC), it has been actively tested for the treatment of OC. However, biomarkers that could indicate the immune status of OC and predict the response to ICB are rare. We downloaded RNAseq and clinical data of OC from The Cancer Genome Atlas (TCGA). Data analysis revealed both TMBhigh and immunityhigh were significantly related to better survival of OC. Up-regulated differentially expressed genes (Up-DEGs) were identified by analyzing the gene expression levels. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed in the “GSVA” and “limma” package in R software. The correlation of genes with overall survival was also analyzed by conducted Kaplan-Meier survival analysis. Four genes, CXCL13, FCRLA, MS4A1, and PLA2G2D were found positively correlated with better prognosis of OC and mainly involved in immune response-related pathways. Finally, TIMER and TIDE were used to predict gene immune function and its association with immunotherapy. We found that these four genes were positively correlated with better response to immune checkpoint blockade-based immunotherapy. Altogether, CXCL13, FCRLA, MS4A1, and PLA2G2D may be used as potential therapeutic genes for reflecting OC immune status and predicting response to immunotherapy.
f
Table 1_SciLinker: a large-scale text mining framework for mapping...
frontiersin.figshare.com
xlsx
Updated Mar 19, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dongyu Liu; Cora Ames; Shameer Khader; Franck Rapaport (2025). Table 1_SciLinker: a large-scale text mining framework for mapping associations among biological entities.xlsx [Dataset]. http://doi.org/10.3389/frai.2025.1528562.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2025.1528562.s001
Dataset updated
Mar 19, 2025
Dataset provided by
Frontiers
Authors
Dongyu Liu; Cora Ames; Shameer Khader; Franck Rapaport
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
IntroductionThe biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora.MethodsWe developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction.ResultsWe identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker’s ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature.ConclusionSciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Datasets used for evaluating the customized version of Apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.s001

Datasets used for evaluating the customized version of Apriori algorithm.

Explore at:

zipAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0154493.s001

Dataset updated

May 31, 2023

Dataset provided by

PLOShttp://plos.org/

Authors

Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A zip archive containing microbial abundance tables which were employed for deciphering association rules using the customised version of the Apriori algorithm. (ZIP)

Clear search

Close search

Google apps

Main menu

Datasets used for evaluating the customized version of Apriori algorithm.

Number of association rules generated using the Apriori rule mining approach...

Data Analysis for the Systematic Literature Review of DL4SE

Data Sheet 1_From data to decision: empirical application of machine...

Table_1_Predicting Anxiety in Routine Palliative Care Using...

Data from: Extended Comprehensive Study of Association Measures for Fault...

Number of association rules generated from the HMP (male) dataset with...

Data from: Correlation between the green-island phenotype and Wolbachia...

Data from: Empirical Study of the Relationship between Design Patterns and...

Characteristics of real datsets and parameter settings.

Food Reviews - Text Mining & Sentiment Analysis

Number of association rules generated from the prebiotics dataset with...

Table 1_Coping behavior toward occupational health risks among construction...

Supporting Information S1 - Exploiting SNP Correlations within Random Forest...

Data from: A network approach for discovering spatially associated objects

Transactions.

Data after data processing.

Table2_Identification of a Gene Set Correlated With Immune Status in Ovarian...

Table 1_SciLinker: a large-scale text mining framework for mapping...

Datasets used for evaluating the customized version of Apriori algorithm.