13 datasets found

f
Apriori algorithm-based association rules.
plos.figshare.com
bin
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang (2023). Apriori algorithm-based association rules. [Dataset]. http://doi.org/10.1371/journal.pone.0289749.t001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289749.t001
Dataset updated
Aug 8, 2023
Dataset provided by
PLOS ONE
Authors
Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
In recent years, the prevalence of T2DM has been increasing annually, in particular, the personal and socioeconomic burden caused by multiple complications has become increasingly serious. This study aimed to screen out the high-risk complication combination of T2DM through various data mining methods, establish and evaluate a risk prediction model of the complication combination in patients with T2DM. Questionnaire surveys, physical examinations, and biochemical tests were conducted on 4,937 patients with T2DM, and 810 cases of sample data with complications were retained. The high-risk complication combination was screened by association rules based on the Apriori algorithm. Risk factors were screened using the LASSO regression model, random forest model, and support vector machine. A risk prediction model was established using logistic regression analysis, and a dynamic nomogram was constructed. Receiver operating characteristic (ROC) curves, harrell’s concordance index (C-Index), calibration curves, decision curve analysis (DCA), and internal validation were used to evaluate the differentiation, calibration, and clinical applicability of the models. This study found that patients with T2DM had a high-risk combination of lower extremity vasculopathy, diabetic foot, and diabetic retinopathy. Based on this, body mass index, diastolic blood pressure, total cholesterol, triglyceride, 2-hour postprandial blood glucose and blood urea nitrogen levels were screened and used for the modeling analysis. The area under the ROC curves of the internal and external validations were 0.768 (95% CI, 0.744−0.792) and 0.745 (95% CI, 0.669−0.820), respectively, and the C-index and AUC value were consistent. The calibration plots showed good calibration, and the risk threshold for DCA was 30–54%. In this study, we developed and evaluated a predictive model for the development of a high-risk complication combination while uncovering the pattern of complications in patients with T2DM. This model has a practical guiding effect on the health management of patients with T2DM in community settings.
Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
f
The hyperparameters of the apriori algorithm.
plos.figshare.com
xls
Updated Jun 21, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Saeyeon Cheon; Thanin Methiyothin; Insung Ahn (2023). The hyperparameters of the apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0282119.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0282119.t002
Dataset updated
Jun 21, 2023
Dataset provided by
PLOS ONE
Authors
Saeyeon Cheon; Thanin Methiyothin; Insung Ahn
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The hyperparameters of the apriori algorithm.
A
‘Groceries dataset ’ analyzed by Analyst-2
analyst-2.ai
Updated Aug 15, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com) (2015). ‘Groceries dataset ’ analyzed by Analyst-2 [Dataset]. https://analyst-2.ai/analysis/kaggle-groceries-dataset-b6be/136ba9af/?iid=001-023&v=presentation
Explore at:
Dataset updated
Aug 15, 2015
Dataset authored and provided by
Analyst-2 (analyst-2.ai) / Inspirient GmbH (inspirient.com)
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Analysis of ‘Groceries dataset ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/heeraldedhia/groceries-dataset on 28 January 2022.

--- Dataset description provided by original source is as follows ---

Association Rule Mining

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Details of the dataset

The dataset has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm.

Apriori Algorithm

Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

An example of Association Rules

Assume there are 100 customers 10 of them bought milk, 8 bought butter and 6 bought both of them. bought milk => bought butter support = P(Milk & Butter) = 6/100 = 0.06 confidence = support/P(Butter) = 0.06/0.08 = 0.75 lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Some important terms:

Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.

Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

Lift: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.

--- Original source retains full ownership of the source dataset ---
f
Stunting final dataset.
plos.figshare.com
bin
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie (2025). Stunting final dataset. [Dataset]. http://doi.org/10.1371/journal.pone.0316452.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316452.s001
Dataset updated
Jan 24, 2025
Dataset provided by
PLOS ONE
Authors
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
BackgroundStunting is a vital indicator of chronic undernutrition that reveals a failure to reach linear growth. Investigating growth and nutrition status during adolescence, in addition to infancy and childhood is very crucial. However, the available studies in Ethiopia have been usually focused in early childhood and they used the traditional stastical methods. Therefore, this study aimed to employ multiple machine learning algorithms to identify the most effective model for the prediction of stunting among adolescent girls in Ethiopia.MethodsA total of 3156 weighted samples of adolescent girls aged 15–19 years were used from the 2016 Ethiopian Demographic and Health Survey dataset. The data was pre-processed, and 80% and 20% of the observations were used for training, and testing the model, respectively. Eight machine learning algorithms were included for consideration of model building and comparison. The performance of the predictive model was evaluated using evaluation metrics value through Python software. The synthetic minority oversampling technique was used for data balancing and Boruta algorithm was used to identify best features. Association rule mining using an Apriori algorithm was employed to generate the best rule for the association between the independent feature and the targeted feature using R software.ResultsThe random forest classifier (sensitivity = 81%, accuracy = 77%, precision = 75%, f1-score = 78%, AUC = 85%) outperformed in predicting stunting compared to other ML algorithms considered in this study. Region, poor wealth index, no formal education, unimproved toilet facility, rural residence, not used contraceptive method, religion, age, no media exposure, occupation, and having one or more children were the top attributes to predict stunting. Association rule mining was identified the top seven best rules that most frequently associated with stunting among adolescent girls in Ethiopia.ConclusionThe random forest classifier outperformed in predicting and identifying the relevant predictors of stunting. Results have shown that machine learning algorithms can accurately predict stunting, making them potentially valuable as decision-support tools for the relevant stakeholders and giving emphasis for the identified predictors could be an important intervention to halt stunting among adolescent girls.
t
Generated datasets for frequent itemset mining algorithms - Dataset - LDM
service.tib.eu
Updated Dec 16, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). Generated datasets for frequent itemset mining algorithms - Dataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/generated-datasets-for-frequent-itemset-mining-algorithms
Explore at:
Dataset updated
Dec 16, 2024
Description
Generated datasets for frequent itemset mining algorithms Apriori, Eclat, and FP-Growth.
f
Socio-demographic characteristics among adolescent girls in Ethiopia, 2016...
figshare.com
plos.figshare.com
xls
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie (2025). Socio-demographic characteristics among adolescent girls in Ethiopia, 2016 EDHS. [Dataset]. http://doi.org/10.1371/journal.pone.0316452.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316452.t001
Dataset updated
Jan 24, 2025
Dataset provided by
PLOS ONE
Authors
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ethiopia
Description
Socio-demographic characteristics among adolescent girls in Ethiopia, 2016 EDHS.
f
The data of Apriori algorithm.
plos.figshare.com
txt
Updated May 23, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fangyuan Li; Xia Wang; Zenglei Feng; Jian Wang; Mengdi Li; Kun JIANG; Changli ZHAO (2024). The data of Apriori algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0302216.s002
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302216.s002
Dataset updated
May 23, 2024
Dataset provided by
PLOS ONE
Authors
Fangyuan Li; Xia Wang; Zenglei Feng; Jian Wang; Mengdi Li; Kun JIANG; Changli ZHAO
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The real-time monitoring on the risk status of the vehicle and its driver can provide the assistance for the early detection and blocking control of single-vehicle accidents. However, complex risk coupling relationship is one of the main features of single-vehicle accidents with high mortality rate. On the basis of investigating the coupling effect among multi-risk factors and establishing a safety management database throughout the life cycle of vehicles, single-vehicle driving risk network (SVDRN) with a three-level threshold was developed, and its topology features were analyzed to assessment the importance of nodes. To avoid the one-sidedness of single indicator, the multi-attribute comprehensive evaluation model was applied to measure the comprehensive effect of characteristic indicators for nodes importance. A algorithm for real-time monitoring of vehicle driving risk status was proposed to identify key risk chains. The result revealed that improper operation, speeding, loss of vehicle control and inefficient driver management were the sequence of top four risk factors in the comprehensive evaluation result of nodes importance (mean value = 0.185, SD = 0.119). There were minor differences of 0.017 in the node importance among environmental factors, among which non-standard road alignment had the larger value. The improper operation and non-standard road alignment were the highest combination correlation of factors affecting road safety, with the support of 51.81% and the confidence of 69.35%. This identification algorithm of key risk chains that combines node importance and its risk state threshold can effectively determine the high-frequency risk transmission paths and risk factors through multi-vehicle test, providing a basis for centralization management of transport enterprises.
Differences in demographic and clinical characteristics between the no case...
plos.figshare.com
bin
Updated Aug 8, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang (2023). Differences in demographic and clinical characteristics between the no case and case groups. [Dataset]. http://doi.org/10.1371/journal.pone.0289749.t002
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0289749.t002
Dataset updated
Aug 8, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Differences in demographic and clinical characteristics between the no case and case groups.
f
Student’s t test of DGAARM and Apriori.
plos.figshare.com
xls
Updated Mar 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xiaoxuan Wu; Qiang Wen; Jun Zhu (2024). Student’s t test of DGAARM and Apriori. [Dataset]. http://doi.org/10.1371/journal.pone.0299865.t010
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0299865.t010
Dataset updated
Mar 4, 2024
Dataset provided by
PLOS ONE
Authors
Xiaoxuan Wu; Qiang Wen; Jun Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Understanding air quality requires a comprehensive understanding of its various factors. Most of the association rule techniques focuses on high frequency terms, ignoring the potential importance of low- frequency terms and causing unnecessary storage space waste. Therefore, a dynamic genetic association rule mining algorithm is proposed in this paper, which combines the improved dynamic genetic algorithm with the association rule mining algorithm to realize the importance mining of low- frequency terms. Firstly, in the chromosome coding phase of genetic algorithm, an innovative multi-information coding strategy is proposed, which selectively stores similar values of different levels in one storage unit. It avoids storing all the values at once and facilitates efficient mining of valid rules later. Secondly, by weighting the evaluation indicators such as support, confidence and promotion in association rule mining, a new evaluation index is formed, avoiding the need to set a minimum threshold for high-interest rules. Finally, in order to improve the mining performance of the rules, the dynamic crossover rate and mutation rate are set to improve the search efficiency of the algorithm. In the experimental stage, this paper adopts the 2016 annual air quality data set of Beijing to verify the effectiveness of the unit point multi-information coding strategy in reducing the rule storage air, the effectiveness of mining the rules formed by the low frequency item set, and the effectiveness of combining the rule mining algorithm with the swarm intelligence optimization algorithm in terms of search time and convergence. In the experimental stage, this paper adopts the 2016 annual air quality data set of Beijing to verify the effectiveness of the above three aspects. The unit point multi-information coding strategy reduced the rule space storage consumption by 50%, the new evaluation index can mine more interesting rules whose interest level can be up to 90%, while mining the rules formed by the lower frequency terms, and in terms of search time, we reduced it about 20% compared with some meta-heuristic algorithms, while improving convergence.
f
Number of association rules generated from the prebiotics dataset with...
plos.figshare.com
xls
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande (2023). Number of association rules generated from the prebiotics dataset with various run-time thresholds. [Dataset]. http://doi.org/10.1371/journal.pone.0154493.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0154493.t002
Dataset updated
Jun 3, 2023
Dataset provided by
PLOS ONE
Authors
Disha Tandon; Mohammed Monzoorul Haque; Sharmila S. Mande
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Number of association rules generated using the Apriori rule mining approach on the prebiotics dataset at various values of support count and confidence thresholds. Table also depicts variations in number of rules due to adoption of various strategies that define the minimum abundance threshold for individual taxa to be considered for rule mining.
f
Table_1_Urban–Rural Differences in Patterns and Associated Factors of...
figshare.com
bin
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chichen Zhang; Shujuan Xiao; Lei Shi; Yaqing Xue; Xiao Zheng; Fang Dong; Jiachi Zhang; Benli Xue; Huang Lin; Ping Ouyang (2023). Table_1_Urban–Rural Differences in Patterns and Associated Factors of Multimorbidity Among Older Adults in China: A Cross-Sectional Study Based on Apriori Algorithm and Multinomial Logistic Regression.XLS [Dataset]. http://doi.org/10.3389/fpubh.2021.707062.s001
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2021.707062.s001
Dataset updated
Jun 1, 2023
Dataset provided by
Frontiers
Authors
Chichen Zhang; Shujuan Xiao; Lei Shi; Yaqing Xue; Xiao Zheng; Fang Dong; Jiachi Zhang; Benli Xue; Huang Lin; Ping Ouyang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
China
Description
Introduction: Multimorbidity has become one of the key issues in the public health sector. This study aimed to explore the urban–rural differences in patterns and associated factors of multimorbidity in China and to provide scientific reference for the development of health management strategies to reduce health inequality between urban and rural areas.Methods: A cross-sectional study, which used a multi-stage random sampling method, was conducted effectively among 3,250 participants in the Shanxi province of China. The chi-square test was used to compare the prevalence of chronic diseases among older adults with different demographic characteristics. The Apriori algorithm and multinomial logistic regression were used to explore the patterns and associated factors of multimorbidity among older adults, respectively.Results: The findings showed that 30.3% of older adults reported multimorbidity, with significantly higher proportions in rural areas. Among urban older adults, 10 binary chronic disease combinations with strong association strength were obtained. In addition, 11 binary chronic disease combinations and three ternary chronic disease combinations with strong association strength were obtained among rural older adults. In rural and urban areas, there is a large gap in patterns and factors associated with multimorbidity.Conclusions: Multimorbidity was prevalent among older adults, which patterns mainly consisted of two or three chronic diseases. The patterns and associated factors of multimorbidity varied from urban to rural regions. Expanding the study of urban–rural differences in multimorbidity will help the country formulate more reasonable public health policies to maximize the benefits of medical services for all.
f
Mining co-occurrence and sequence patterns from cancer diagnoses in New York...
plos.figshare.com
figshare.com
xlsx
Updated May 30, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yu Wang; Wei Hou; Fusheng Wang (2023). Mining co-occurrence and sequence patterns from cancer diagnoses in New York State [Dataset]. http://doi.org/10.1371/journal.pone.0194407
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0194407
Dataset updated
May 30, 2023
Dataset provided by
PLOS ONE
Authors
Yu Wang; Wei Hou; Fusheng Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
New York
Description
The goal of this study is to discover disease co-occurrence and sequence patterns from large scale cancer diagnosis histories in New York State. In particular, we want to identify disparities among different patient groups. Our study will provide essential knowledge for clinical researchers to further investigate comorbidities and disease progression for improving the management of multiple diseases. We used inpatient discharge and outpatient visit records from the New York State Statewide Planning and Research Cooperative System (SPARCS) from 2011-2015. We grouped each patient’s visit history to generate diagnosis sequences for seven most popular cancer types. We performed frequent disease co-occurrence mining using the Apriori algorithm, and frequent disease sequence patterns discovery using the cSPADE algorithm. Different types of cancer demonstrated distinct patterns. Disparities of both disease co-occurrence and sequence patterns were observed from patients within different age groups. There were also considerable disparities in disease co-occurrence patterns with respect to different claim types (i.e., inpatient, outpatient, emergency department and ambulatory surgery). Disparities regarding genders were mostly found where the cancer types were gender specific. Supports of most patterns were usually higher for males than for females. Compared with secondary diagnosis codes, primary diagnosis codes can convey more stable results. Two disease sequences consisting of the same diagnoses but in different orders were usually with different supports. Our results suggest that the methods adopted can generate potentially interesting and clinically meaningful disease co-occurrence and sequence patterns, and identify disparities among various patient groups. These patterns could imply comorbidities and disease progressions.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang (2023). Apriori algorithm-based association rules. [Dataset]. http://doi.org/10.1371/journal.pone.0289749.t001

Apriori algorithm-based association rules.

Explore at:

34 scholarly articles cite this dataset (View in Google Scholar)

binAvailable download formats

Unique identifier

https://doi.org/10.1371/journal.pone.0289749.t001

Dataset updated

Aug 8, 2023

Dataset provided by

PLOS ONE

Authors

Xin Luo; Jijia Sun; Hong Pan; Dian Zhou; Ping Huang; Jingjing Tang; Rong Shi; Hong Ye; Ying Zhao; An Zhang

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

In recent years, the prevalence of T2DM has been increasing annually, in particular, the personal and socioeconomic burden caused by multiple complications has become increasingly serious. This study aimed to screen out the high-risk complication combination of T2DM through various data mining methods, establish and evaluate a risk prediction model of the complication combination in patients with T2DM. Questionnaire surveys, physical examinations, and biochemical tests were conducted on 4,937 patients with T2DM, and 810 cases of sample data with complications were retained. The high-risk complication combination was screened by association rules based on the Apriori algorithm. Risk factors were screened using the LASSO regression model, random forest model, and support vector machine. A risk prediction model was established using logistic regression analysis, and a dynamic nomogram was constructed. Receiver operating characteristic (ROC) curves, harrell’s concordance index (C-Index), calibration curves, decision curve analysis (DCA), and internal validation were used to evaluate the differentiation, calibration, and clinical applicability of the models. This study found that patients with T2DM had a high-risk combination of lower extremity vasculopathy, diabetic foot, and diabetic retinopathy. Based on this, body mass index, diastolic blood pressure, total cholesterol, triglyceride, 2-hour postprandial blood glucose and blood urea nitrogen levels were screened and used for the modeling analysis. The area under the ROC curves of the internal and external validations were 0.768 (95% CI, 0.744−0.792) and 0.745 (95% CI, 0.669−0.820), respectively, and the C-index and AUC value were consistent. The calibration plots showed good calibration, and the risk threshold for DCA was 30–54%. In this study, we developed and evaluated a predictive model for the development of a high-risk complication combination while uncovering the pattern of complications in patients with T2DM. This model has a practical guiding effect on the health management of patients with T2DM in community settings.

Clear search

Close search

Google apps

Main menu

Apriori algorithm-based association rules.

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

The hyperparameters of the apriori algorithm.

‘Groceries dataset ’ analyzed by Analyst-2

Association Rule Mining

Details of the dataset

Apriori Algorithm

An example of Association Rules

Some important terms:

Stunting final dataset.

Generated datasets for frequent itemset mining algorithms - Dataset - LDM

Socio-demographic characteristics among adolescent girls in Ethiopia, 2016...

The data of Apriori algorithm.

Differences in demographic and clinical characteristics between the no case...

Student’s t test of DGAARM and Apriori.

Number of association rules generated from the prebiotics dataset with...

Table_1_Urban–Rural Differences in Patterns and Associated Factors of...

Mining co-occurrence and sequence patterns from cancer diagnoses in New York...

Apriori algorithm-based association rules.