26 datasets found
  1. Market Basket Analysis

    • kaggle.com
    zip
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    zip(23875170 bytes)Available download formats
    Dataset updated
    Dec 9, 2021
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  2. Characteristics of cyclist crashes in Italy using latent class analysis and...

    • plos.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Characteristics of cyclist crashes in Italy using latent class analysis and association rule mining [Dataset]. http://doi.org/10.1371/journal.pone.0171484
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Italy
    Description

    The factors associated with severity of the bicycle crashes may differ across different bicycle crash patterns. Therefore, it is important to identify distinct bicycle crash patterns with homogeneous attributes. The current study aimed at identifying subgroups of bicycle crashes in Italy and analyzing separately the different bicycle crash types. The present study focused on bicycle crashes that occurred in Italy during the period between 2011 and 2013. We analyzed categorical indicators corresponding to the characteristics of infrastructure (road type, road signage, and location type), road user (i.e., opponent vehicle and cyclist’s maneuver, type of collision, age and gender of the cyclist), vehicle (type of opponent vehicle), and the environmental and time period variables (time of the day, day of the week, season, pavement condition, and weather). To identify homogenous subgroups of bicycle crashes, we used latent class analysis. Using latent class analysis, the bicycle crash data set was segmented into 19 classes, which represents 19 different bicycle crash types. Logistic regression analysis was used to identify the association between class membership and severity of the bicycle crashes. Finally, association rules were conducted for each of the latent classes to uncover the factors associated with an increased likelihood of severity. Association rules highlighted different crash characteristics associated with an increased likelihood of severity for each of the 19 bicycle crash types.

  3. Z

    Association Rules and Semantic Relatedness (ARSR) - Evaluation Data

    • data.niaid.nih.gov
    Updated Jun 20, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leon Hutans (2022). Association Rules and Semantic Relatedness (ARSR) - Evaluation Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6655888
    Explore at:
    Dataset updated
    Jun 20, 2022
    Dataset provided by
    Friedrich Schiller University Jena
    Authors
    Leon Hutans
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    ARSR stands for "Association Rules and Semantic Relatedness", a recommender system that combines association rule mining and semantic relatedness to create recommendations for form fields.

    ARSR was evaluated with focus on recommending values for fields of metadata forms. The evaluation was performed with two sets of association rules (R1 and R2). R1 was primarily used to assess the recommendation performance. R2 served as an alternative and led to similar results.

    This dataset contains the raw data ("collected", JSON format), generated during the evaluation. It includes i.a. input (populated fields and target field), expected output, and the top 40 generated recommendations for each test combination.

    In addition, the processed data ("analysed", CSV format) is provided. It is based on the raw data and is used to calculate metrics and plot the results.

    The source code for ARSR and the evaluation is available on GitLab (gitlab.com).

    Note: To perform the analysis on the raw data (collected) yourself, make sure to follow the setup instruction in the evaluation repository first. More specifically, install dependencies and unzip "data/cedar/test-instances.zip" so that URI mappings can be accessed. Then follow the instructions provided by the README file in the raw data archives (e.g. collected-R1.zip).

  4. DatasetofDatasets (DoD)

    • kaggle.com
    zip
    Updated Aug 12, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/terminalgr/datasetofdatasets-124-1242024
    Explore at:
    zip(7583 bytes)Available download formats
    Dataset updated
    Aug 12, 2024
    Authors
    Konstantinos Malliaridis
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

    This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

    2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

    3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

    4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

    0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

    The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

    There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44

  5. f

    Data Sheet 1_From data to decision: empirical application of machine...

    • figshare.com
    csv
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang (2025). Data Sheet 1_From data to decision: empirical application of machine learning in public space planning along the Grand Canal, Shandong Province, China.csv [Dataset]. http://doi.org/10.3389/fbuil.2025.1643104.s001
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    Frontiers
    Authors
    Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Jinghang Waterway, China, Shandong
    Description

    IntroductionIn the process of urbanization, public space plays an increasingly important role in improving the livability and sustainability of cities. However, effectively understanding the preferences of different groups for public space and conducting reasonable planning integrated with environmental and infrastructure elements remains a challenge in urban planning. This is because traditional planning methods often fail to fully capture the detailed behavior of residents. Therefore, the purpose of this study was to explore the empirical application of machine learning technology to public space planning along the Grand Canal in Shandong Province (China), analyze the behavior patterns and preferences of residents regarding different public spaces, and thereby provide support for data - driven public space planning.MethodsBased on survey data from 1008 respondents across 4 cities, this study employed machine learning methods such as K - means clustering, association rule mining, and correlation analysis to investigate the relationships between visitor behavior and the environmental characteristics of public spaces.ResultsThe application of these methods yielded several important results. Cluster analysis identified three distinct groups: young and middle - aged local residents with a preference for accessibility, middle - aged and elderly groups enthusiastic about cultural engagement, and diverse transportation users with mixed spatial preferences. Additionally, association rule mining uncovered strong correlations between location types and perceived attributes such as cleanliness and aesthetics. Moreover, correlation analysis indicated statistically significant positive correlations between aesthetics and cleanliness, as well as between safety and cleanliness.DiscussionThis research offers valuable data - driven insights for public space planning and management. It demonstrates that machine learning can effectively identify and quantify key factors influencing public space use. As a result, it provides more accurate policy recommendations for urban planners and ensures that public space planning better meets the needs of different groups. For urban planners, the findings can guide the optimization of facility layouts for specific groups. For instance, adding canal cultural display nodes for cultural engagement groups and improving barrier - free facilities for groups with high accessibility needs, thereby enhancing the inclusiveness and utilization efficiency of public spaces.

  6. Z

    Data Analysis for the Systematic Literature Review of DL4SE

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
    Explore at:
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    College of William and Mary
    Washington and Lee University
    Authors
    Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

    The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

    Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

    Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

    Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

    Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

    Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

    We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

    Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

    Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise

  7. Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent...

    • plos.figshare.com
    xls
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent Classes. [Dataset]. http://doi.org/10.1371/journal.pone.0171484.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent Classes.

  8. Retail Market Basket Transactions Dataset

    • kaggle.com
    Updated Aug 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Wasiq Ali
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Overview

    The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
    It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

    • File Name: Market_Basket_Optimisation.csv
    • Format: CSV (Comma-Separated Values)
    • Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.
    • Nature of Data: Transactional, categorical, sparse.
    • Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

    Detailed Information

    📊 Dataset Composition

    • Transactions: 7,501 (each row = one basket).
    • Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).
    • Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).
    • Data Type: Purely categorical (no numerical or continuous features).
    • Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).
    • Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

    🛒 Nature of Transactions

    • Basket Definition: Each row captures items bought together during a single visit to the store.
    • Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.
    • Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

    🔎 Examples of Data

    Example transaction rows (simplified):

    Item 1Item 2Item 3Item 4...
    BreadButterJam
    Mineral waterChocolateEggsMilk
    SpaghettiTomato sauceParmesan

    Here, empty cells mean no item was purchased in that slot.

    📈 Applications of This Dataset

    This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

    1. Association Rule Mining (Apriori, FP-Growth):

      • Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.
      • Identify cross-selling opportunities.
    2. Product Affinity Analysis:

      • Understand which items tend to be purchased together.
      • Helps with store layout decisions (placing related items near each other).
    3. Recommendation Engines:

      • Build systems that suggest "You may also like" products.
      • Example: If a customer buys pasta and tomato sauce, recommend cheese.
    4. Marketing Campaigns:

      • Bundle promotions and discounts on frequently co-purchased products.
      • Personalized offers based on buying history.
    5. Inventory Management:

      • Anticipate demand for certain product combinations.
      • Prevent stockouts of items that drive the purchase of others.

    📌 Key Insights Potentially Hidden in the Dataset

    • Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.
    • Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.
    • Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.
    • Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

    📂 Dataset Limitations

    1. No Customer Identifiers:

      • We cannot track repeated purchases by the same customer.
      • Analysis is limited to basket-level insights.
    2. No Timestamps:

      • No temporal analysis (trends over time, seasonality) is possible.
    3. No Quantities or Prices:

      • We only know whether an item was purchased, not how many units or its cost.
    4. Sparse & Noisy:

      • Many baskets are small (1–2 items), which may produce weak or trivial rules.

    🔮 Potential Extensions

    • Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.
    • Add Customer IDs: If merged with external data, one can perform personalized recommendations.
    • Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).
    • Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

    ...

  9. Association rules for the clinical feature dataset—Cardiovascular risk.

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature dataset—Cardiovascular risk. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t005
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Association rules for the clinical feature dataset—Cardiovascular risk.

  10. Association rules for the clinical feature and gene expression datasets in...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature and gene expression datasets in conjunction. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Association rules for the clinical feature and gene expression datasets in conjunction.

  11. Data from: Spotify Playlists

    • zenodo.org
    csv
    Updated Jan 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Francesco Cambria; Francesco Cambria (2025). Spotify Playlists [Dataset]. http://doi.org/10.5281/zenodo.14728731
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Francesco Cambria; Francesco Cambria
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset was constructed based on the data found in Kaggle from Spotify.

    The files here reported can be used to build a property graph in Neo4J:

    • song.csv - contains all the data for the Song nodes.
    • artist.csv - contains the data for the Artist nodes.
    • playlist.csv - contains the data for the Playlist nodes.
    • user.csv - contains the data for the Playlist nodes (those creating Playlists).
    • genre.csv - contains the data for the Genre nodes (a category for the Artists).
    • type.csv - contains the data for the Type nodes (a category for the Playlists).
    • sing.csv - contains the data for the SING relationship from Artist to Song nodes.
    • created.csv - contains the data for the CREATED relationship from User to Playlist nodes.
    • in.csv - contains the data for the IN relationship from Song to Playlist nodes.
    • of_type.csv - contains the data for the OFTYPE relationship from Playlist to Type nodes.
    • labelled.csv - contains the data for the LABELLED relationship from Artist to Genre nodes.

    This data was used as test dataset in the paper "MINE GRAPH RULE: A New GQL Operator for Mining Association Rules in Property Graph Databases".

  12. phishing-domain-detection

    • kaggle.com
    zip
    Updated Jan 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    raviraj kukade (2022). phishing-domain-detection [Dataset]. https://www.kaggle.com/ravirajkukade/phishingdomaindetection
    Explore at:
    zip(4850922 bytes)Available download formats
    Dataset updated
    Jan 30, 2022
    Authors
    raviraj kukade
    Description

    a b s t r a c t:

    Phishing stands for a fraudulent process, where an attacker tries to obtain sensitive information from the victim. Usually, these kinds of attacks are done via emails, text messages, or websites. Phishing websites, which are nowadays in a considerable rise, have the same look as legitimate sites. However, their backend is designed to collect sensitive information that is inputted by the victim. Discovering and detecting phishing websites has recently also gained the machine learning community’s attention, which has built the models and performed classifications of phishing websites. This paper presents two dataset variations that consist of 58,645 and 88,647 websites labeled as legitimate or phishing and allow the researchers to train their classification models, build phishing detection systems, and mining association rules.

    Specifications Table:

    1. Subject- Computer Science

    2. Specific subject area- Artificial Intelligence

    3. Type of data- csv file

    4. How data were acquired- Data were acquired through the publicly available lists of phishing and legitimate websites, from which the features presented in the datasets were extracted.

    5. Data format Raw: csv file

    6. Parameters for data collection- For the phishing websites, only the ones from the PhishTank registry were included, which are verified from multiple users. For the legitimate websites, we included the websites from publicly available, community labeled and organized lists [1], and from the Alexa top ranking websites.

    7. Description of data collection- The data is comprised of the features extracted from the collections of websites addresses. The data in total consists of 111 features, 96 of which are extracted from the website address itself, while the remaining 15 features were extracted using custom Python code. Data source location Worldwide

    Value of the Data

    • These data consist of a collection of legitimate, as well as phishing website instances. Each website is represented by the set of features that denote whether the website is legitimate or not. Data can serve as input for the machine learning process.

    • Machine learning and data mining researchers can benefit from these datasets, while also computer security researchers and practitioners. Computer security enthusiasts can find these datasets interesting for building firewalls, intelligent ad blockers, and malware detection systems.

    • This dataset can help researchers and practitioners easily build classification models in systems preventing phishing attacks since the presented datasets feature the attributes which can be easily extracted.

    • Finally, the provided datasets could also be used as a performance benchmark for developing state-of-the-art machine learning methods for the task of phishing websites classification.

    Data Description

    The presented dataset was collected and prepared for the purpose of building and evaluating various classification methods for the task of detecting phishing websites based on the uniform resource locator (URL) properties, URL resolving metrics, and external services. The attributes of the prepared dataset can be divided into six groups:

    • attributes based on the whole URL properties presented in Table 1, • attributes based on the domain properties presented in Table 2, • attributes based on the URL directory properties presented in Table 3, • attributes based on the URL file properties presented in Table 4, • attributes based on the URL parameter properties presented in Table 5, and • attributes based on the URL resolving data and external metrics presented in Table 6

  13. f

    Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod...

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod diam. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod diam.

  14. Retail Transaction Dataset

    • kaggle.com
    zip
    Updated May 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bkcoban (2025). Retail Transaction Dataset [Dataset]. https://www.kaggle.com/datasets/bkcoban/retail-transaction-dataset/discussion
    Explore at:
    zip(538441 bytes)Available download formats
    Dataset updated
    May 30, 2025
    Authors
    bkcoban
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains 30,000 unique retail transactions, each representing a customer's shopping basket in a simulated grocery store environment. The data was generated with realistic product combinations and purchase patterns, suitable for association rule mining, recommendation systems and market basket analysis.

    Each row corresponds to a single transaction, listing:

    • A unique transaction ID
    • A customer ID
    • The full list of products bought in that transaction
    • The time of the transaction

    The dataset includes products across various categories such as beverages, snacks, dairy, household items, fruits, vegetables and frozen foods.

    This data is entirely synthetic and does not contain any real user information.

  15. Summary of Logistic Regression Analysis Predicting the Severity of Bicycle...

    • figshare.com
    xls
    Updated Jun 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Summary of Logistic Regression Analysis Predicting the Severity of Bicycle Crashes. [Dataset]. http://doi.org/10.1371/journal.pone.0171484.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 18, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Summary of Logistic Regression Analysis Predicting the Severity of Bicycle Crashes.

  16. f

    Data from: Investigation of vehicle-bicycle hit-and-run crashes

    • tandf.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Siying Zhu (2023). Investigation of vehicle-bicycle hit-and-run crashes [Dataset]. http://doi.org/10.6084/m9.figshare.12918699.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Siying Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although cycling has been promoted around the world as a sustainable mode of transportation, bicyclists are among the most vulnerable road users, subject to high injury and fatality risk. The vehicle-bicycle hit-and-run crashes degrade the morality and result in delays of medical services provided to victims. This paper aims to determine the significant factors that contribute to drivers’ hit-and-run behavior in vehicle-bicycle crashes and their interdependency based on a 6-year crash dataset of Victoria, Australia, with an integrated data mining framework. The framework integrates imbalanced data resampling, near zero variance predictor elimination, learning-based feature extraction with random forest algorithm, and association rule mining. The crash-related features that play the most important role in classifying hit-and-run crashes are identified as collision type, gender, age group, vehicle passengers involved, severity of accident, speed zone, road classification, divided road, region and peak hour. The result of the paper can further provide implications on the policies and counter-measures in order to prevent bicyclists from vehicle-bicycle hit-and-run collisions.

  17. Association rules for the clinical feature dataset—Diabetic dyslipidemia.

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 14, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature dataset—Diabetic dyslipidemia. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 14, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Association rules for the clinical feature dataset—Diabetic dyslipidemia.

  18. Description of the clinical features of the 143 subjects enrolled in this...

    • plos.figshare.com
    • datasetcatalog.nlm.nih.gov
    xls
    Updated Jun 5, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Description of the clinical features of the 143 subjects enrolled in this study (%ts stands for % of tooth sites). [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Description of the clinical features of the 143 subjects enrolled in this study (%ts stands for % of tooth sites).

  19. Strong association rules for different emergency brake types.

    • plos.figshare.com
    xls
    Updated Sep 18, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaqin He; Jun Xia; Jiayin Dai (2025). Strong association rules for different emergency brake types. [Dataset]. http://doi.org/10.1371/journal.pone.0320834.t008
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Sep 18, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Yaqin He; Jun Xia; Jiayin Dai
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Strong association rules for different emergency brake types.

  20. f

    Data from: Machine Learning Analysis of Cytotoxicity Determinants in...

    • acs.figshare.com
    xlsx
    Updated Oct 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elif Yildirim; Irem Cakir; Nazar Ileri-Ercan (2025). Machine Learning Analysis of Cytotoxicity Determinants in Nanoparticle-Based Rheumatoid Arthritis Therapies [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.5c00661.s002
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Oct 23, 2025
    Dataset provided by
    ACS Publications
    Authors
    Elif Yildirim; Irem Cakir; Nazar Ileri-Ercan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Nanoparticle-based therapies have gained attention in recent years as promising treatments for rheumatoid arthritis (RA), due to the potential offered for targeted delivery, controlled drug release, and improved biocompatibility. A deep understanding of the factors that drive cytotoxicity is crucial for safer and more effective nanomedicine formulations. To systematically analyze the determinants of cytotoxicity reported in the literature, we constructed a data set comprising 2,060 instances from 56 publications. Each instance was described by 23 features covering nanoparticle characteristics, cellular environment factors, and assay conditions potentially associated with cytotoxicity. Machine learning (ML) approaches were incorporated to gain deeper insight into key cytotoxicity drivers. We combined Boruta for feature selection, Random Forest (RF) for cytotoxicity prediction and feature importance evaluation, and Association Rule Mining (ARM) for rule-based, hidden pattern discovery. Boruta feature selection results identified the drug and nanoparticle concentration, core–shell material, and cell type as major determinants of cytotoxicity. The RF model demonstrated a strong predictive performance, further confirming the significance of these features. Moreover, ARM revealed high-confidence association rules linking specific conditions, such as high drug concentrations and poly(aspartic acid)-based systems, to cytotoxic outcomes. This structured machine learning framework provides a foundation for optimizing nanoparticle formulations that balance therapeutic efficacy with cellular safety in RA therapy.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Organization logo

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

  • Data Import
  • Data Understanding and Exploration
  • Transformation of the data – so that is ready to be consumed by the association rules algorithm
  • Running association rules
  • Exploring the rules generated
  • Filtering the generated rules
  • Visualization of Rule

Dataset Description

  • File name: Assignment-1_Data
  • List name: retaildata
  • File format: . xlsx
  • Number of Row: 522065
  • Number of Attributes: 7

    • BillNo: 6-digit number assigned to each transaction. Nominal.
    • Itemname: Product name. Nominal.
    • Quantity: The quantities of each product per transaction. Numeric.
    • Date: The day and time when each transaction was generated. Numeric.
    • Price: Product price. Numeric.
    • CustomerID: 5-digit number assigned to each customer. Nominal.
    • Country: Name of the country where each customer resides. Nominal.

imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

  • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
  • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
  • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
  • readxl - Read Excel Files in R.
  • plyr - Tools for Splitting, Applying and Combining Data.
  • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  • knitr - Dynamic Report generation in R.
  • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Search
Clear search
Close search
Google apps
Main menu