Facebook
TwitterThis dataset was created by Shashank
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset is from my paper:
Heaton, J. (2016, March). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent itemset mining algorithms. In SoutheastCon 2016 (pp. 1-7). IEEE.
Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.
We generated two datasets that allow us to adjust two independent variables to create a total of 20 different transaction sets. We also provide the Python script that generated this data in a notebook. This Python script accepts the following parameters to specify the transaction set to produce:
Files contained in this dataset reside in two folders: * freq-items-pct - We vary the frequent set density in these transaction sets. * freq-items-tsz - We change the maximum number of items per basket in these transaction sets.
While you can vary basket count, the number of frequent sets, and the number of items in the script, they will remain fixed at this paper's above values. We determined that the basket count only had a small positive correlation.
The following listing shows the type of data generated for this research. Here we present an example file created with ten baskets out of 100 items, two frequent itemsets, a maximum basket size of 10, and a density of 0.5.
I36 I94
I71 I13 I91 I89 I34
F6 F5 F3 F4
I86
I39 I16 I49 I62 I31 I54 I91
I22 I31
I70 I85 I78 I63
F4 F3 F1 F6 F0 I69 I44
I82 I50 I9 I31 I57 I20
F4 F3 F1 F6 F0 I87
As you can see from the above file, the items are either prefixed with “I” or “F.” The “F” prefix indicates that this line contains one of the frequent itemsets. Items with the “I” prefix are not part of an intentional frequent itemset. Of course, “I” prefixed items might form frequent itemsets, as they are uniformly sampled from the number of things to fill out nonfrequent itemsets. Each basket will have a random size chosen, up to the maximum basket size. The frequent itsemset density specifies the probability of each line containing one of the intentional frequent itemsets. Because we used a density of 0.5, approximately half of the lines above include one of the two intentional frequent itemsets. A frequent itemset line may have additional random “I” prefixed items added to cause the line to reach the randomly chosen length for that line. If the frequent itemset selected does cause the generated sequence to exceed its randomly chosen length, no truncation will occur. The intentional frequent itemsets are all determined to be less than or equal to the maximum basket size.
Facebook
TwitterWithin the confines of this document, we embark on a comprehensive journey delving into the intricacies of a dataset meticulously curated for the purpose of association rules mining. This sophisticated data mining technique is a linchpin in the realms of market basket analysis. The dataset in question boasts an array of items commonly found in retail transactions, each meticulously encoded as a binary variable, with "1" denoting presence and "0" indicating absence in individual transactions.
Our dataset unfolds as an opulent tapestry of distinct columns, each dedicated to the representation of a specific item:
The raison d'être of this dataset is to serve as a catalyst for the discovery of intricate associations and patterns concealed within the labyrinthine network of customer transactions. Each row in this dataset mirrors a solitary transaction, while the values within each column serve as sentinels, indicating whether a particular item was welcomed into a transaction's embrace or relegated to the periphery.
The data within this repository is rendered in a binary symphony, where the enigmatic "1" enunciates the acquisition of an item, and the stoic "0" signifies its conspicuous absence. This binary manifestation serves to distill the essence of the dataset, centering the focus on item presence, rather than the quantum thereof.
This dataset unfurls its wings to encompass an assortment of prospective applications, including but not limited to:
The treasure trove of this dataset beckons the deployment of quintessential techniques, among them the venerable Apriori and FP-Growth algorithms. These stalwart algorithms are proficient at ferreting out the elusive frequent itemsets and invaluable association rules, shedding light on the arcane symphony of customer behavior and item co-occurrence patterns.
In closing, the association rules dataset unfurled before you offers an alluring odyssey, replete with the promise of discovering priceless patterns and affiliations concealed within the tapestry of transactional data. Through the artistry of data mining algorithms, businesses and analysts stand poised to unearth hitherto latent insights capable of steering the helm of strategic decisions, elevating the pantheon of customer experiences, and orchestrating the symphony of operational optimization.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data are provided to illustrate methods in evaluating systematic transactional data reuse in machine learning. A library account-based recommender system was developed using machine learning processing over transactional data of 383,828 transactions (or check-outs) sourced from a large multi-unit research library. The machine learning process utilized the FP-growth algorithm over the subject metadata associated with physical items that were checked-out together in the library. The purpose of this research is to evaluate the results of systematic transactional data reuse in machine learning. The analysis herein contains a large-scale network visualization of 180,441 subject association rules and corresponding node metrics.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. Spark application of the e-learning recommender system.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Mined rules by FP-Growth algorithm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains transactional data collected for market basket analysis. Each row represents a single transaction with items purchased together. It is ideal for implementing association rule mining techniques such as Apriori, FP-Growth, and other machine learning algorithms.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the new model of China’s dual-circulation economy, the opening-up and deepening of financial markets have imposed higher requirements on the risk management capacity of financial institutions, with the issue of loan customers losing contact and defaulting becoming an urgent concern. Based on desensitized samples of lost-linking customers (with multidimensional features such as communication behavior and loan qualifications), this study uses the FP-Growth algorithm to systematically mine association rules between loss-of-contact features and three modes: “Hide and Seek”, “Flee with the Money”, and “False Disappearance”, providing effective risk management strategies for financial institutions. Through association rule mining, this study reveals significant correlations between some feature combinations and lost-linking modes. The results reveal substantial variations in correlation strength among different feature combinations and lost-linking modes, and the association strength increases significantly with the prolongation of overdue time. The results provide banks with quantitative early warning signs based on feature combinations, which can be applied to risk-grading monitoring systems. The research emphasizes the requirement for combined analysis of multidimensional features and dynamic monitoring in precise risk control.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
🛒 Retail POS Dataset for Market Basket Analysis
📌 Dataset Overview
This dataset is a synthetically generated retail Point-of-Sale (POS) dataset designed for Market Basket Analysis (MBA), Association Rule Mining, and Sales Pattern Identification. It simulates transactions in a supermarket/retail environment, where each order (basket) contains multiple items across different product categories.
The dataset is ideal for applying Apriori, FP-Growth, and ECLAT algorithms to uncover:
📊 Dataset Size
📦 Categories
The dataset includes items from 12 realistic retail categories:
📑 Column Description
| Column Name | Description |
|---|---|
| order_id | Unique ID for each order (basket) |
| user_id | Unique ID for customer |
| order_date | Date of the order |
| time | Time of the transaction (HH:MM:SS) |
| order_hour_of_day | Hour of purchase (6–22) |
| product_name | Purchased item name |
| quantity | Units of the product bought |
| price | Price of the product (in local currency) |
| category | Product category |
| product_id | Unique ID for product |
🔍 Possible Use Cases
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Additional file 1. Table 1. Disease generalization in ICD-10 codes. Table 2. Comparison among OMOP ID, Concept Code and the generlization ICD-10 codes. Table 3. The rules verified by literatures. Table 4. The rules discoveried by FP-growth algorithm.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Introduction: Motilin (MLN) is a gastrointestinal (GI) hormone produced in the upper small intestine. Its most well understood function is to participate in Phase III of the migrating myoelectric complex component of GI motility. Changes in MLN availability are associated with GI diseases such as gastroesophageal reflux disease and functional dyspepsia. Furthermore, herbal medicines have been used for several years to treat various GI disorders. We systematically reviewed clinical and animal studies on how herbal medicine affects the modulation of MLN and subsequently brings the therapeutic effects mainly focused on GI function.Methods: We searched the PubMed, Embase, Cochrane, and Web of Science databases to collect all articles published until 30 July 2023, that reported the measurement of plasma MLN levels in human randomized controlled trials and in vivo herbal medicine studies. The collected characteristics of the articles included the name and ingredients of the herbal medicine, physiological and symptomatic changes after administering the herbal medicine, changes in plasma MLN levels, key findings, and mechanisms of action. The frequency patterns (FPs) of botanical drug use and their correlations were investigated using an FP growth algorithm.Results: Nine clinical studies with 1,308 participants and 20 animal studies were included in the final analyses. Herbal medicines in clinical studies have shown therapeutic effects in association with increased levels of MLN, including GI motility regulation and symptom improvement. Herbal medicines have also shown anti-stress, anti-tumor, and anti-inflammatory effects in vivo. Various biochemical markers may correlate with MLN levels. Markers may have a positive correlation with plasma MLN levels included ghrelin, acetylcholine, and secretin, whereas a negative correlation included triglycerides and prostaglandin E2. Markers, such as gastrin and somatostatin, did not show any correlation with plasma MLN levels. Based on the FP growth algorithm, Glycyrrhiza uralensis and Paeonia japonica were the most frequently used species.Conclusion: Herbal medicine may have therapeutic effects mainly on GI symptoms with involvement of MLN regulation and may be considered as an alternative option for the treatment of GI diseases. Further studies with more solid evidence are needed to confirm the efficacy and mechanisms of action of herbal medicines.Systematic Review Registration:https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=443244, identifier CRD42023443244.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
BackgroundConcerns about the role of chronically used medications in the clinical outcomes of the coronavirus disease 2019 (COVID-19) have remarkable potential for the breakdown of non-communicable diseases (NCDs) management by imposing ambivalence toward medication continuation. This study aimed to investigate the association of single or combinations of chronically used medications in NCDs with clinical outcomes of COVID-19.MethodsThis retrospective study was conducted on the intersection of two databases, the Iranian COVID-19 registry and Iran Health Insurance Organization. The primary outcome was death due to COVID-19 hospitalization, and secondary outcomes included length of hospital stay, Intensive Care Unit (ICU) admission, and ventilation therapy. The Anatomical Therapeutic Chemical (ATC) classification system was used for medication grouping. The frequent pattern growth algorithm was utilized to investigate the effect of medication combinations on COVID-19 outcomes.FindingsAspirin with chronic use in 10.8% of hospitalized COVID-19 patients was the most frequently used medication, followed by Atorvastatin (9.2%) and Losartan (8.0%). Adrenergics in combination with corticosteroids inhalants (ACIs) with an odds ratio (OR) of 0.79 (95% confidence interval: 0.68–0.92) were the most associated medications with less chance of ventilation therapy. Oxicams had the least OR of 0.80 (0.73–0.87) for COVID-19 death, followed by ACIs [0.85 (0.77–0.95)] and Biguanides [0.86 (0.82–0.91)].ConclusionThe chronic use of most frequently used medications for NCDs management was not associated with poor COVID-19 outcomes. Thus, when indicated, physicians need to discourage patients with NCDs from discontinuing their medications for fear of possible adverse effects on COVID-19 prognosis.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterThis dataset was created by Shashank