7 datasets found

Characteristics that Favor Freq-Itemset Algorithms
kaggle.com
Updated Oct 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Heaton (2020). Characteristics that Favor Freq-Itemset Algorithms [Dataset]. https://www.kaggle.com/jeffheaton/characteristics-that-favor-freqitemset-algorithms
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 24, 2020
Dataset provided by
Kaggle
Authors
Jeff Heaton
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Source Paper

This dataset is from my paper:

Heaton, J. (2016, March). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent itemset mining algorithms. In SoutheastCon 2016 (pp. 1-7). IEEE.

Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.

Files Generated

We generated two datasets that allow us to adjust two independent variables to create a total of 20 different transaction sets. We also provide the Python script that generated this data in a notebook. This Python script accepts the following parameters to specify the transaction set to produce:

Transaction/Basket count: 5 million default

Number of items: 50,000 default

Number of frequent sets: 100 default

Max transaction/basket size: independent variable, 5-100 range

Frequent set density: independent variable, 0.1 to 0.8 range

Files contained in this dataset reside in two folders: * freq-items-pct - We vary the frequent set density in these transaction sets. * freq-items-tsz - We change the maximum number of items per basket in these transaction sets.

While you can vary basket count, the number of frequent sets, and the number of items in the script, they will remain fixed at this paper's above values. We determined that the basket count only had a small positive correlation.

File Content

The following listing shows the type of data generated for this research. Here we present an example file created with ten baskets out of 100 items, two frequent itemsets, a maximum basket size of 10, and a density of 0.5.

I36 I94 I71 I13 I91 I89 I34 F6 F5 F3 F4 I86 I39 I16 I49 I62 I31 I54 I91 I22 I31 I70 I85 I78 I63 F4 F3 F1 F6 F0 I69 I44 I82 I50 I9 I31 I57 I20 F4 F3 F1 F6 F0 I87

As you can see from the above file, the items are either prefixed with “I” or “F.” The “F” prefix indicates that this line contains one of the frequent itemsets. Items with the “I” prefix are not part of an intentional frequent itemset. Of course, “I” prefixed items might form frequent itemsets, as they are uniformly sampled from the number of things to fill out nonfrequent itemsets. Each basket will have a random size chosen, up to the maximum basket size. The frequent itsemset density specifies the probability of each line containing one of the intentional frequent itemsets. Because we used a density of 0.5, approximately half of the lines above include one of the two intentional frequent itemsets. A frequent itemset line may have additional random “I” prefixed items added to cause the line to reach the randomly chosen length for that line. If the frequent itemset selected does cause the generated sequence to exceed its randomly chosen length, no truncation will occur. The intentional frequent itemsets are all determined to be less than or equal to the maximum basket size.
Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Accuracy and AUC value of ML algorithms using three hyper parameter tuning...
plos.figshare.com
xls
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie (2025). Accuracy and AUC value of ML algorithms using three hyper parameter tuning techniques. [Dataset]. http://doi.org/10.1371/journal.pone.0316452.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316452.t003
Dataset updated
Jan 24, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Accuracy and AUC value of ML algorithms using three hyper parameter tuning techniques.
Book123
kaggle.com
zip
Updated Nov 23, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
gaurav9712 (2020). Book123 [Dataset]. https://www.kaggle.com/gaurav9712/book123
Explore at:
zip(4913 bytes)Available download formats
Dataset updated
Nov 23, 2020
Authors
gaurav9712
Description
Prepare rules for the all the data sets 1) Try different values of support and confidence. Observe the change in number of rules for different support,confidence values 2) Change the minimum length in apriori algorithm 3) Visualize the obtained rules using different plots
Market basket analysis
kaggle.com
zip
Updated Feb 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Boopathi M (2024). Market basket analysis [Dataset]. https://www.kaggle.com/datasets/boopathi09945/market-basket-analysis
Explore at:
zip(72860 bytes)Available download formats
Dataset updated
Feb 17, 2024
Authors
Boopathi M
Description
Market basket analysis with Python as we uncover hidden patterns and relationships within transactional data. Discover how algorithms like Apriori can reveal valuable insights into customer behavior, product associations, and purchasing trends. Explore the power of data-driven decision-making in retail, marketing, and beyond, as we navigate through the fascinating realm of market basket analysis.
Socio-demographic characteristics among adolescent girls in Ethiopia, 2016...
figshare.com
datasetcatalog.nlm.nih.gov
+1more
xls
Updated Jan 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie (2025). Socio-demographic characteristics among adolescent girls in Ethiopia, 2016 EDHS. [Dataset]. http://doi.org/10.1371/journal.pone.0316452.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0316452.t001
Dataset updated
Jan 24, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Alemu Birara Zemariam; Biruk Beletew Abate; Addis Wondmagegn Alamaw; Eyob shitie Lake; Gizachew Yilak; Mulat Ayele; Befkad Derese Tilahun; Habtamu Setegn Ngusie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Ethiopia
Description
Socio-demographic characteristics among adolescent girls in Ethiopia, 2016 EDHS.
market_basket_optimization
kaggle.com
zip
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rupak Roy/ Bob (2023). market_basket_optimization [Dataset]. https://www.kaggle.com/rupakroy/market-basket-optimization
Explore at:
zip(47991 bytes)Available download formats
Dataset updated
Feb 11, 2023
Authors
Rupak Roy/ Bob
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The dataset is specially curated for Association Rule Learning using **Apriori and Eclat **using Python to predict Shopping Behavior.

Apriori is one of the powerful algorithms to understand association among the products. Take an example of a supermarket where most of the person buys egg also buys milk and also baking soda. Probably the reason is they want to bake a cake for new year's eve.

So we can see there is an association between eggs, milk as well as baking soda. Now after knowing such association we simply put all the 3 things together in the shelf and that definitely will increase our sales.

Let’s perform Apriori with the help of an example.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Jeff Heaton (2020). Characteristics that Favor Freq-Itemset Algorithms [Dataset]. https://www.kaggle.com/jeffheaton/characteristics-that-favor-freqitemset-algorithms

Characteristics that Favor Freq-Itemset Algorithms

Data attributes that favor the Apriori, Eclat or FP-Growth algorithms

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Oct 24, 2020

Dataset provided by

Kaggle

Authors

Jeff Heaton

License

http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html

Description

Source Paper

This dataset is from my paper:

Heaton, J. (2016, March). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent itemset mining algorithms. In SoutheastCon 2016 (pp. 1-7). IEEE.

Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.

Files Generated

We generated two datasets that allow us to adjust two independent variables to create a total of 20 different transaction sets. We also provide the Python script that generated this data in a notebook. This Python script accepts the following parameters to specify the transaction set to produce:

Transaction/Basket count: 5 million default
Number of items: 50,000 default
Number of frequent sets: 100 default
Max transaction/basket size: independent variable, 5-100 range
Frequent set density: independent variable, 0.1 to 0.8 range

Files contained in this dataset reside in two folders: * freq-items-pct - We vary the frequent set density in these transaction sets. * freq-items-tsz - We change the maximum number of items per basket in these transaction sets.

While you can vary basket count, the number of frequent sets, and the number of items in the script, they will remain fixed at this paper's above values. We determined that the basket count only had a small positive correlation.

File Content

The following listing shows the type of data generated for this research. Here we present an example file created with ten baskets out of 100 items, two frequent itemsets, a maximum basket size of 10, and a density of 0.5.

I36 I94 
I71 I13 I91 I89 I34
F6 F5 F3 F4 
I86 
I39 I16 I49 I62 I31 I54 I91 
I22 I31 
I70 I85 I78 I63 
F4 F3 F1 F6 F0 I69 I44 
I82 I50 I9 I31 I57 I20 
F4 F3 F1 F6 F0 I87

As you can see from the above file, the items are either prefixed with “I” or “F.” The “F” prefix indicates that this line contains one of the frequent itemsets. Items with the “I” prefix are not part of an intentional frequent itemset. Of course, “I” prefixed items might form frequent itemsets, as they are uniformly sampled from the number of things to fill out nonfrequent itemsets. Each basket will have a random size chosen, up to the maximum basket size. The frequent itsemset density specifies the probability of each line containing one of the intentional frequent itemsets. Because we used a density of 0.5, approximately half of the lines above include one of the two intentional frequent itemsets. A frequent itemset line may have additional random “I” prefixed items added to cause the line to reach the randomly chosen length for that line. If the frequent itemset selected does cause the generated sequence to exceed its randomly chosen length, no truncation will occur. The intentional frequent itemsets are all determined to be less than or equal to the maximum basket size.

Clear search

Close search

Google apps

Main menu

Characteristics that Favor Freq-Itemset Algorithms

Source Paper

Files Generated

File Content

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Accuracy and AUC value of ML algorithms using three hyper parameter tuning...

Book123

Market basket analysis

Socio-demographic characteristics among adolescent girls in Ethiopia, 2016...

market_basket_optimization

Characteristics that Favor Freq-Itemset Algorithms

Data attributes that favor the Apriori, Eclat or FP-Growth algorithms

Source Paper

Files Generated

File Content