26 datasets found

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Characteristics of cyclist crashes in Italy using latent class analysis and...
plos.figshare.com
docx
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Characteristics of cyclist crashes in Italy using latent class analysis and association rule mining [Dataset]. http://doi.org/10.1371/journal.pone.0171484
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171484
Dataset updated
May 31, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Italy
Description
The factors associated with severity of the bicycle crashes may differ across different bicycle crash patterns. Therefore, it is important to identify distinct bicycle crash patterns with homogeneous attributes. The current study aimed at identifying subgroups of bicycle crashes in Italy and analyzing separately the different bicycle crash types. The present study focused on bicycle crashes that occurred in Italy during the period between 2011 and 2013. We analyzed categorical indicators corresponding to the characteristics of infrastructure (road type, road signage, and location type), road user (i.e., opponent vehicle and cyclist’s maneuver, type of collision, age and gender of the cyclist), vehicle (type of opponent vehicle), and the environmental and time period variables (time of the day, day of the week, season, pavement condition, and weather). To identify homogenous subgroups of bicycle crashes, we used latent class analysis. Using latent class analysis, the bicycle crash data set was segmented into 19 classes, which represents 19 different bicycle crash types. Logistic regression analysis was used to identify the association between class membership and severity of the bicycle crashes. Finally, association rules were conducted for each of the latent classes to uncover the factors associated with an increased likelihood of severity. Association rules highlighted different crash characteristics associated with an increased likelihood of severity for each of the 19 bicycle crash types.
Z
Association Rules and Semantic Relatedness (ARSR) - Evaluation Data
data.niaid.nih.gov
Updated Jun 20, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Leon Hutans (2022). Association Rules and Semantic Relatedness (ARSR) - Evaluation Data [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6655888
Explore at:
Dataset updated
Jun 20, 2022
Dataset provided by
Friedrich Schiller University Jena
Authors
Leon Hutans
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
ARSR stands for "Association Rules and Semantic Relatedness", a recommender system that combines association rule mining and semantic relatedness to create recommendations for form fields.

ARSR was evaluated with focus on recommending values for fields of metadata forms. The evaluation was performed with two sets of association rules (R1 and R2). R1 was primarily used to assess the recommendation performance. R2 served as an alternative and led to similar results.

This dataset contains the raw data ("collected", JSON format), generated during the evaluation. It includes i.a. input (populated fields and target field), expected output, and the top 40 generated recommendations for each test combination.

In addition, the processed data ("analysed", CSV format) is provided. It is based on the raw data and is used to calculate metrics and plot the results.

The source code for ARSR and the evaluation is available on GitLab (gitlab.com).

Note: To perform the analysis on the raw data (collected) yourself, make sure to follow the setup instruction in the evaluation repository first. More specifically, install dependencies and unzip "data/cedar/test-instances.zip" so that URI mappings can be accessed. Then follow the instructions provided by the README file in the raw data archives (e.g. collected-R1.zip).
DatasetofDatasets (DoD)
kaggle.com
zip
Updated Aug 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Malliaridis (2024). DatasetofDatasets (DoD) [Dataset]. https://www.kaggle.com/terminalgr/datasetofdatasets-124-1242024
Explore at:
zip(7583 bytes)Available download formats
Dataset updated
Aug 12, 2024
Authors
Konstantinos Malliaridis
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This dataset is essentially the metadata from 164 datasets. Each of its lines concerns a dataset from which 22 features have been extracted, which are used to classify each dataset into one of the categories 0-Unmanaged, 2-INV, 3-SI, 4-NOA (DatasetType).

This Dataset consists of 164 Rows. Each row is the metadata of an other dataset. The target column is datasetType which has 4 values indicating the dataset type. These are:

2 - Invoice detail (INV): This dataset type is a special report (usually called Detailed Sales Statement) produced by a Company Accounting or an Enterprise Resource Planning software (ERP). Using a INV-type dataset directly for ARM is extremely convenient for users as it relieves them from the tedious work of transforming data into another more suitable form. INV-type data input typically includes a header but, only two of its attributes are essential for data mining. The first attribute serves as the grouping identifier creating a unique transaction (e.g., Invoice ID, Order Number), while the second attribute contains the items utilized for data mining (e.g., Product Code, Product Name, Product ID).

3 - Sparse Item (SI): This type is widespread in Association Rules Mining (ARM). It involves a header and a fixed number of columns. Each item corresponds to a column. Each row represents a transaction. The typical cell stores a value, usually one character in length, that depicts the presence or absence of the item in the corresponding transaction. The absence character must be identified or declared before the Association Rules Mining process takes place.

4 - Nominal Attributes (NOA): This type is commonly used in Machine Learning and Data Mining tasks. It involves a fixed number of columns. Each column registers nominal/categorical values. The presence of a header row is optional. However, in cases where no header is provided, there is a risk of extracting incorrect rules if similar values exist in different attributes of the dataset. The potential values for each attribute can vary.

0 - Unmanaged for ARM: On the other hand, not all datasets are suitable for extracting useful association rules or frequent item sets. For instance, datasets characterized predominantly by numerical features with arbitrary values, or datasets that involve fragmented or mixed types of data types. For such types of datasets, ARM processing becomes possible only by introducing a data discretization stage which in turn introduces information loss. Such types of datasets are not considered in the present treatise and they are termed (0) Unmanaged in the sequel.

The dataset type is crucial to determine for ARM, and the current dataset is used to classify the dataset's type using a Supervised Machine Learning Model.

There is and another dataset type named 1 - Market Basket List (MBL) where each dataset row is a transaction. A transaction involves a variable number of items. However, due to this characteristic, these datasets can be easily categorized using procedural programming and DoD does not include instances of them. For more details about Dataset Types please refer to article "WebApriori: a web application for association rules mining". https://link.springer.com/chapter/10.1007/978-3-030-49663-0_44
f
Data Sheet 1_From data to decision: empirical application of machine...
figshare.com
csv
Updated Oct 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang (2025). Data Sheet 1_From data to decision: empirical application of machine learning in public space planning along the Grand Canal, Shandong Province, China.csv [Dataset]. http://doi.org/10.3389/fbuil.2025.1643104.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fbuil.2025.1643104.s001
Dataset updated
Oct 17, 2025
Dataset provided by
Frontiers
Authors
Jing Zhao; Yuan Jiang; Xiuhua Zhang; Qing Ye; Qiang Zhao; Xianhua Wu; Linshen Wang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Jinghang Waterway, China, Shandong
Description
IntroductionIn the process of urbanization, public space plays an increasingly important role in improving the livability and sustainability of cities. However, effectively understanding the preferences of different groups for public space and conducting reasonable planning integrated with environmental and infrastructure elements remains a challenge in urban planning. This is because traditional planning methods often fail to fully capture the detailed behavior of residents. Therefore, the purpose of this study was to explore the empirical application of machine learning technology to public space planning along the Grand Canal in Shandong Province (China), analyze the behavior patterns and preferences of residents regarding different public spaces, and thereby provide support for data - driven public space planning.MethodsBased on survey data from 1008 respondents across 4 cities, this study employed machine learning methods such as K - means clustering, association rule mining, and correlation analysis to investigate the relationships between visitor behavior and the environmental characteristics of public spaces.ResultsThe application of these methods yielded several important results. Cluster analysis identified three distinct groups: young and middle - aged local residents with a preference for accessibility, middle - aged and elderly groups enthusiastic about cultural engagement, and diverse transportation users with mixed spatial preferences. Additionally, association rule mining uncovered strong correlations between location types and perceived attributes such as cleanliness and aesthetics. Moreover, correlation analysis indicated statistically significant positive correlations between aesthetics and cleanliness, as well as between safety and cleanliness.DiscussionThis research offers valuable data - driven insights for public space planning and management. It demonstrates that machine learning can effectively identify and quantify key factors influencing public space use. As a result, it provides more accurate policy recommendations for urban planners and ensures that public space planning better meets the needs of different groups. For urban planners, the findings can guide the optimization of facility layouts for specific groups. For instance, adding canal cultural display nodes for cultural engagement groups and improving barrier - free facilities for groups with high accessibility needs, thereby enhancing the inclusiveness and utilization efficiency of public spaces.
Z
Data Analysis for the Systematic Literature Review of DL4SE
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk (2024). Data Analysis for the Systematic Literature Review of DL4SE [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4768586
Explore at:
Dataset updated
Jul 19, 2024
Dataset provided by
College of William and Mary
Washington and Lee University
Authors
Cody Watson; Nathan Cooper; David Nader; Kevin Moran; Denys Poshyvanyk
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Analysis is the process that supports decision-making and informs arguments in empirical studies. Descriptive statistics, Exploratory Data Analysis (EDA), and Confirmatory Data Analysis (CDA) are the approaches that compose Data Analysis (Xia & Gong; 2014). An Exploratory Data Analysis (EDA) comprises a set of statistical and data mining procedures to describe data. We ran EDA to provide statistical facts and inform conclusions. The mined facts allow attaining arguments that would influence the Systematic Literature Review of DL4SE.

The Systematic Literature Review of DL4SE requires formal statistical modeling to refine the answers for the proposed research questions and formulate new hypotheses to be addressed in the future. Hence, we introduce DL4SE-DA, a set of statistical processes and data mining pipelines that uncover hidden relationships among Deep Learning reported literature in Software Engineering. Such hidden relationships are collected and analyzed to illustrate the state-of-the-art of DL techniques employed in the software engineering context.

Our DL4SE-DA is a simplified version of the classical Knowledge Discovery in Databases, or KDD (Fayyad, et al; 1996). The KDD process extracts knowledge from a DL4SE structured database. This structured database was the product of multiple iterations of data gathering and collection from the inspected literature. The KDD involves five stages:

Selection. This stage was led by the taxonomy process explained in section xx of the paper. After collecting all the papers and creating the taxonomies, we organize the data into 35 features or attributes that you find in the repository. In fact, we manually engineered features from the DL4SE papers. Some of the features are venue, year published, type of paper, metrics, data-scale, type of tuning, learning algorithm, SE data, and so on.

Preprocessing. The preprocessing applied was transforming the features into the correct type (nominal), removing outliers (papers that do not belong to the DL4SE), and re-inspecting the papers to extract missing information produced by the normalization process. For instance, we normalize the feature “metrics” into “MRR”, “ROC or AUC”, “BLEU Score”, “Accuracy”, “Precision”, “Recall”, “F1 Measure”, and “Other Metrics”. “Other Metrics” refers to unconventional metrics found during the extraction. Similarly, the same normalization was applied to other features like “SE Data” and “Reproducibility Types”. This separation into more detailed classes contributes to a better understanding and classification of the paper by the data mining tasks or methods.

Transformation. In this stage, we omitted to use any data transformation method except for the clustering analysis. We performed a Principal Component Analysis to reduce 35 features into 2 components for visualization purposes. Furthermore, PCA also allowed us to identify the number of clusters that exhibit the maximum reduction in variance. In other words, it helped us to identify the number of clusters to be used when tuning the explainable models.

Data Mining. In this stage, we used three distinct data mining tasks: Correlation Analysis, Association Rule Learning, and Clustering. We decided that the goal of the KDD process should be oriented to uncover hidden relationships on the extracted features (Correlations and Association Rules) and to categorize the DL4SE papers for a better segmentation of the state-of-the-art (Clustering). A clear explanation is provided in the subsection “Data Mining Tasks for the SLR od DL4SE”. 5.Interpretation/Evaluation. We used the Knowledge Discover to automatically find patterns in our papers that resemble “actionable knowledge”. This actionable knowledge was generated by conducting a reasoning process on the data mining outcomes. This reasoning process produces an argument support analysis (see this link).

We used RapidMiner as our software tool to conduct the data analysis. The procedures and pipelines were published in our repository.

Overview of the most meaningful Association Rules. Rectangles are both Premises and Conclusions. An arrow connecting a Premise with a Conclusion implies that given some premise, the conclusion is associated. E.g., Given that an author used Supervised Learning, we can conclude that their approach is irreproducible with a certain Support and Confidence.

Support = Number of occurrences this statement is true divided by the amount of statements Confidence = The support of the statement divided by the number of occurrences of the premise
Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent...
plos.figshare.com
xls
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent Classes. [Dataset]. http://doi.org/10.1371/journal.pone.0171484.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171484.t002
Dataset updated
Jun 1, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent Classes.
Retail Market Basket Transactions Dataset
kaggle.com
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

File Name: Market_Basket_Optimisation.csv

Format: CSV (Comma-Separated Values)

Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.

Nature of Data: Transactional, categorical, sparse.

Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

Detailed Information

📊 Dataset Composition

Transactions: 7,501 (each row = one basket).

Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).

Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).

Data Type: Purely categorical (no numerical or continuous features).

Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).

Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

🛒 Nature of Transactions

Basket Definition: Each row captures items bought together during a single visit to the store.

Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.

Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

🔎 Examples of Data

Example transaction rows (simplified):

Item 1 Item 2 Item 3 Item 4 ...
Bread Butter Jam
Mineral water Chocolate Eggs Milk
Spaghetti Tomato sauce Parmesan

Here, empty cells mean no item was purchased in that slot.

📈 Applications of This Dataset

This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

Association Rule Mining (Apriori, FP-Growth):

Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.

Identify cross-selling opportunities.

Product Affinity Analysis:

Understand which items tend to be purchased together.

Helps with store layout decisions (placing related items near each other).

Recommendation Engines:

Build systems that suggest "You may also like" products.

Example: If a customer buys pasta and tomato sauce, recommend cheese.

Marketing Campaigns:

Bundle promotions and discounts on frequently co-purchased products.

Personalized offers based on buying history.

Inventory Management:

Anticipate demand for certain product combinations.

Prevent stockouts of items that drive the purchase of others.

📌 Key Insights Potentially Hidden in the Dataset

Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.

Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.

Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.

Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

📂 Dataset Limitations

No Customer Identifiers:

We cannot track repeated purchases by the same customer.

Analysis is limited to basket-level insights.

No Timestamps:

No temporal analysis (trends over time, seasonality) is possible.

No Quantities or Prices:

We only know whether an item was purchased, not how many units or its cost.

Sparse & Noisy:

Many baskets are small (1–2 items), which may produce weak or trivial rules.

🔮 Potential Extensions

Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.

Add Customer IDs: If merged with external data, one can perform personalized recommendations.

Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).

Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

...
Association rules for the clinical feature dataset—Cardiovascular risk.
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature dataset—Cardiovascular risk. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0240269.t005
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Association rules for the clinical feature dataset—Cardiovascular risk.
Association rules for the clinical feature and gene expression datasets in...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature and gene expression datasets in conjunction. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0240269.t007
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Association rules for the clinical feature and gene expression datasets in conjunction.
Data from: Spotify Playlists
zenodo.org
csv
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Francesco Cambria; Francesco Cambria (2025). Spotify Playlists [Dataset]. http://doi.org/10.5281/zenodo.14728731
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14728731
Dataset updated
Jan 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Francesco Cambria; Francesco Cambria
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset was constructed based on the data found in Kaggle from Spotify.

The files here reported can be used to build a property graph in Neo4J:

song.csv - contains all the data for the Song nodes.

artist.csv - contains the data for the Artist nodes.

playlist.csv - contains the data for the Playlist nodes.

user.csv - contains the data for the Playlist nodes (those creating Playlists).

genre.csv - contains the data for the Genre nodes (a category for the Artists).

type.csv - contains the data for the Type nodes (a category for the Playlists).

sing.csv - contains the data for the SING relationship from Artist to Song nodes.

created.csv - contains the data for the CREATED relationship from User to Playlist nodes.

in.csv - contains the data for the IN relationship from Song to Playlist nodes.

of_type.csv - contains the data for the OFTYPE relationship from Playlist to Type nodes.

labelled.csv - contains the data for the LABELLED relationship from Artist to Genre nodes.

This data was used as test dataset in the paper "MINE GRAPH RULE: A New GQL Operator for Mining Association Rules in Property Graph Databases".
phishing-domain-detection
kaggle.com
zip
Updated Jan 30, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raviraj kukade (2022). phishing-domain-detection [Dataset]. https://www.kaggle.com/ravirajkukade/phishingdomaindetection
Explore at:
zip(4850922 bytes)Available download formats
Dataset updated
Jan 30, 2022
Authors
raviraj kukade
Description
a b s t r a c t:

Phishing stands for a fraudulent process, where an attacker tries to obtain sensitive information from the victim. Usually, these kinds of attacks are done via emails, text messages, or websites. Phishing websites, which are nowadays in a considerable rise, have the same look as legitimate sites. However, their backend is designed to collect sensitive information that is inputted by the victim. Discovering and detecting phishing websites has recently also gained the machine learning community’s attention, which has built the models and performed classifications of phishing websites. This paper presents two dataset variations that consist of 58,645 and 88,647 websites labeled as legitimate or phishing and allow the researchers to train their classification models, build phishing detection systems, and mining association rules.

Specifications Table:

Subject- Computer Science

Specific subject area- Artificial Intelligence

Type of data- csv file

How data were acquired- Data were acquired through the publicly available lists of phishing and legitimate websites, from which the features presented in the datasets were extracted.

Data format Raw: csv file

Parameters for data collection- For the phishing websites, only the ones from the PhishTank registry were included, which are verified from multiple users. For the legitimate websites, we included the websites from publicly available, community labeled and organized lists [1], and from the Alexa top ranking websites.

Description of data collection- The data is comprised of the features extracted from the collections of websites addresses. The data in total consists of 111 features, 96 of which are extracted from the website address itself, while the remaining 15 features were extracted using custom Python code. Data source location Worldwide

Value of the Data

• These data consist of a collection of legitimate, as well as phishing website instances. Each website is represented by the set of features that denote whether the website is legitimate or not. Data can serve as input for the machine learning process.

• Machine learning and data mining researchers can benefit from these datasets, while also computer security researchers and practitioners. Computer security enthusiasts can find these datasets interesting for building firewalls, intelligent ad blockers, and malware detection systems.

• This dataset can help researchers and practitioners easily build classification models in systems preventing phishing attacks since the presented datasets feature the attributes which can be easily extracted.

• Finally, the provided datasets could also be used as a performance benchmark for developing state-of-the-art machine learning methods for the task of phishing websites classification.

Data Description

The presented dataset was collected and prepared for the purpose of building and evaluating various classification methods for the task of detecting phishing websites based on the uniform resource locator (URL) properties, URL resolving metrics, and external services. The attributes of the prepared dataset can be divided into six groups:

• attributes based on the whole URL properties presented in Table 1, • attributes based on the domain properties presented in Table 2, • attributes based on the URL directory properties presented in Table 3, • attributes based on the URL file properties presented in Table 4, • attributes based on the URL parameter properties presented in Table 5, and • attributes based on the URL resolving data and external metrics presented in Table 6
f
Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod...
plos.figshare.com
figshare.com
xls
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod diam. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0240269.t003
Dataset updated
Jun 2, 2023
Dataset provided by
PLOS ONE
Authors
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod diam.
Retail Transaction Dataset
kaggle.com
zip
Updated May 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
bkcoban (2025). Retail Transaction Dataset [Dataset]. https://www.kaggle.com/datasets/bkcoban/retail-transaction-dataset/discussion
Explore at:
zip(538441 bytes)Available download formats
Dataset updated
May 30, 2025
Authors
bkcoban
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains 30,000 unique retail transactions, each representing a customer's shopping basket in a simulated grocery store environment. The data was generated with realistic product combinations and purchase patterns, suitable for association rule mining, recommendation systems and market basket analysis.

Each row corresponds to a single transaction, listing:

A unique transaction ID

A customer ID

The full list of products bought in that transaction

The time of the transaction

The dataset includes products across various categories such as beverages, snacks, dairy, household items, fruits, vegetables and frozen foods.

This data is entirely synthetic and does not contain any real user information.
Summary of Logistic Regression Analysis Predicting the Severity of Bicycle...
figshare.com
xls
Updated Jun 18, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni (2023). Summary of Logistic Regression Analysis Predicting the Severity of Bicycle Crashes. [Dataset]. http://doi.org/10.1371/journal.pone.0171484.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0171484.t003
Dataset updated
Jun 18, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Gabriele Prati; Marco De Angelis; Víctor Marín Puchades; Federico Fraboni; Luca Pietrantoni
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Summary of Logistic Regression Analysis Predicting the Severity of Bicycle Crashes.
f
Data from: Investigation of vehicle-bicycle hit-and-run crashes
tandf.figshare.com
docx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Siying Zhu (2023). Investigation of vehicle-bicycle hit-and-run crashes [Dataset]. http://doi.org/10.6084/m9.figshare.12918699.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.12918699.v1
Dataset updated
Jun 4, 2023
Dataset provided by
Taylor & Francis
Authors
Siying Zhu
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Although cycling has been promoted around the world as a sustainable mode of transportation, bicyclists are among the most vulnerable road users, subject to high injury and fatality risk. The vehicle-bicycle hit-and-run crashes degrade the morality and result in delays of medical services provided to victims. This paper aims to determine the significant factors that contribute to drivers’ hit-and-run behavior in vehicle-bicycle crashes and their interdependency based on a 6-year crash dataset of Victoria, Australia, with an integrated data mining framework. The framework integrates imbalanced data resampling, near zero variance predictor elimination, learning-based feature extraction with random forest algorithm, and association rule mining. The crash-related features that play the most important role in classifying hit-and-run crashes are identified as collision type, gender, age group, vehicle passengers involved, severity of accident, speed zone, road classification, divided road, region and peak hour. The result of the paper can further provide implications on the policies and counter-measures in order to prevent bicyclists from vehicle-bicycle hit-and-run collisions.
Association rules for the clinical feature dataset—Diabetic dyslipidemia.
figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 14, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Association rules for the clinical feature dataset—Diabetic dyslipidemia. [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t006
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0240269.t006
Dataset updated
Jun 14, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Association rules for the clinical feature dataset—Diabetic dyslipidemia.
Description of the clinical features of the 143 subjects enrolled in this...
plos.figshare.com
datasetcatalog.nlm.nih.gov
xls
Updated Jun 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga (2023). Description of the clinical features of the 143 subjects enrolled in this study (%ts stands for % of tooth sites). [Dataset]. http://doi.org/10.1371/journal.pone.0240269.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0240269.t001
Dataset updated
Jun 5, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Rosana Veroneze; Sâmia Cruz Tfaile Corbi; Bárbara Roque da Silva; Cristiane de S. Rocha; Cláudia V. Maurer-Morelli; Silvana Regina Perez Orrico; Joni A. Cirelli; Fernando J. Von Zuben; Raquel Mantuaneli Scarel-Caminaga
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description of the clinical features of the 143 subjects enrolled in this study (%ts stands for % of tooth sites).
Strong association rules for different emergency brake types.
plos.figshare.com
xls
Updated Sep 18, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yaqin He; Jun Xia; Jiayin Dai (2025). Strong association rules for different emergency brake types. [Dataset]. http://doi.org/10.1371/journal.pone.0320834.t008
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0320834.t008
Dataset updated
Sep 18, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Yaqin He; Jun Xia; Jiayin Dai
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Strong association rules for different emergency brake types.
f
Data from: Machine Learning Analysis of Cytotoxicity Determinants in...
acs.figshare.com
xlsx
Updated Oct 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elif Yildirim; Irem Cakir; Nazar Ileri-Ercan (2025). Machine Learning Analysis of Cytotoxicity Determinants in Nanoparticle-Based Rheumatoid Arthritis Therapies [Dataset]. http://doi.org/10.1021/acs.molpharmaceut.5c00661.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.molpharmaceut.5c00661.s002
Dataset updated
Oct 23, 2025
Dataset provided by
ACS Publications
Authors
Elif Yildirim; Irem Cakir; Nazar Ileri-Ercan
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Nanoparticle-based therapies have gained attention in recent years as promising treatments for rheumatoid arthritis (RA), due to the potential offered for targeted delivery, controlled drug release, and improved biocompatibility. A deep understanding of the factors that drive cytotoxicity is crucial for safer and more effective nanomedicine formulations. To systematically analyze the determinants of cytotoxicity reported in the literature, we constructed a data set comprising 2,060 instances from 56 publications. Each instance was described by 23 features covering nanoparticle characteristics, cellular environment factors, and assay conditions potentially associated with cytotoxicity. Machine learning (ML) approaches were incorporated to gain deeper insight into key cytotoxicity drivers. We combined Boruta for feature selection, Random Forest (RF) for cytotoxicity prediction and feature importance evaluation, and Association Rule Mining (ARM) for rule-based, hidden pattern discovery. Boruta feature selection results identified the drug and nanoparticle concentration, core–shell material, and cell type as major determinants of cytotoxicity. The RF model demonstrated a strong predictive performance, further confirming the significance of these features. Moreover, ARM revealed high-confidence association rules linking specific conditions, such as high drug concentrations and poly(aspartic acid)-based systems, to cytotoxic outcomes. This structured machine learning framework provides a foundation for optimizing nanoparticle formulations that balance therapeutic efficacy with cellular safety in RA therapy.

Item 1	Item 2	Item 3	Item 4
Bread	Butter	Jam
Mineral water	Chocolate	Eggs	Milk
Spaghetti	Tomato sauce	Parmesan

Facebook

Twitter

Click to copy link

Link copied

Cite

Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(23875170 bytes)Available download formats

Dataset updated

Dec 9, 2021

Authors

Aslan Ahmedov

Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import
Data Understanding and Exploration
Transformation of the data – so that is ready to be consumed by the association rules algorithm
Running association rules
Exploring the rules generated
Filtering the generated rules
Visualization of Rule

Dataset Description

File name: Assignment-1_Data
List name: retaildata
File format: . xlsx
Number of Row: 522065
Number of Attributes: 7
- BillNo: 6-digit number assigned to each transaction. Nominal.
- Itemname: Product name. Nominal.
- Quantity: The quantities of each product per transaction. Numeric.
- Date: The day and time when each transaction was generated. Numeric.
- Price: Product price. Numeric.
- CustomerID: 5-digit number assigned to each customer. Nominal.
- Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
readxl - Read Excel Files in R.
plyr - Tools for Splitting, Applying and Combining Data.
ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
knitr - Dynamic Report generation in R.
magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Clear search

Close search

Google apps

Main menu

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Characteristics of cyclist crashes in Italy using latent class analysis and...

Association Rules and Semantic Relatedness (ARSR) - Evaluation Data

DatasetofDatasets (DoD)

Data Sheet 1_From data to decision: empirical application of machine...

Data Analysis for the Systematic Literature Review of DL4SE

Values of AIC, BIC, aBIC, and CAIC as a Function of the Number of Latent...

Retail Market Basket Transactions Dataset

Overview

Detailed Information

📊 Dataset Composition

🛒 Nature of Transactions

🔎 Examples of Data

📈 Applications of This Dataset

📌 Key Insights Potentially Hidden in the Dataset

📂 Dataset Limitations

🔮 Potential Extensions

...

Association rules for the clinical feature dataset—Cardiovascular risk.

Association rules for the clinical feature and gene expression datasets in...

Data from: Spotify Playlists

phishing-domain-detection

Specifications Table:

Value of the Data

Table caption Nulla mi mi, venenatis sed ipsum varius, volutpat euismod...

Retail Transaction Dataset

Summary of Logistic Regression Analysis Predicting the Severity of Bicycle...

Data from: Investigation of vehicle-bicycle hit-and-run crashes

Association rules for the clinical feature dataset—Diabetic dyslipidemia.

Description of the clinical features of the 143 subjects enrolled in this...

Strong association rules for different emergency brake types.

Data from: Machine Learning Analysis of Cytotoxicity Determinants in...

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing