Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Summarised information pertaining to (a) the number of samples, (b) the number of generated association rules (total as well as rules that involve 3 or more genera), (c) the unique number of microbial genera involved in the identified association rules, (d) execution time, and (e) the number of rules generated using an alternative rule mining strategy (detailed in discussion section of the manuscript).
Facebook
Twitterhttp://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
This dataset is from my paper:
Heaton, J. (2016, March). Comparing dataset characteristics that favor the Apriori, Eclat or FP-Growth frequent itemset mining algorithms. In SoutheastCon 2016 (pp. 1-7). IEEE.
Frequent itemset mining is a popular data mining technique. Apriori, Eclat, and FP-Growth are among the most common algorithms for frequent itemset mining. Considerable research has been performed to compare the relative performance between these three algorithms, by evaluating the scalability of each algorithm as the dataset size increases. While scalability as data size increases is important, previous papers have not examined the performance impact of similarly sized datasets that contain different itemset characteristics. This paper explores the effects that two dataset characteristics can have on the performance of these three frequent itemset algorithms. To perform this empirical analysis, a dataset generator is created to measure the effects of frequent item density and the maximum transaction size on performance. The generated datasets contain the same number of rows. This provides some insight into dataset characteristics that are conducive to each algorithm. The results of this paper's research demonstrate Eclat and FP-Growth both handle increases in maximum transaction size and frequent itemset density considerably better than the Apriori algorithm.
We generated two datasets that allow us to adjust two independent variables to create a total of 20 different transaction sets. We also provide the Python script that generated this data in a notebook. This Python script accepts the following parameters to specify the transaction set to produce:
Files contained in this dataset reside in two folders: * freq-items-pct - We vary the frequent set density in these transaction sets. * freq-items-tsz - We change the maximum number of items per basket in these transaction sets.
While you can vary basket count, the number of frequent sets, and the number of items in the script, they will remain fixed at this paper's above values. We determined that the basket count only had a small positive correlation.
The following listing shows the type of data generated for this research. Here we present an example file created with ten baskets out of 100 items, two frequent itemsets, a maximum basket size of 10, and a density of 0.5.
I36 I94
I71 I13 I91 I89 I34
F6 F5 F3 F4
I86
I39 I16 I49 I62 I31 I54 I91
I22 I31
I70 I85 I78 I63
F4 F3 F1 F6 F0 I69 I44
I82 I50 I9 I31 I57 I20
F4 F3 F1 F6 F0 I87
As you can see from the above file, the items are either prefixed with “I” or “F.” The “F” prefix indicates that this line contains one of the frequent itemsets. Items with the “I” prefix are not part of an intentional frequent itemset. Of course, “I” prefixed items might form frequent itemsets, as they are uniformly sampled from the number of things to fill out nonfrequent itemsets. Each basket will have a random size chosen, up to the maximum basket size. The frequent itsemset density specifies the probability of each line containing one of the intentional frequent itemsets. Because we used a density of 0.5, approximately half of the lines above include one of the two intentional frequent itemsets. A frequent itemset line may have additional random “I” prefixed items added to cause the line to reach the randomly chosen length for that line. If the frequent itemset selected does cause the generated sequence to exceed its randomly chosen length, no truncation will occur. The intentional frequent itemsets are all determined to be less than or equal to the maximum basket size.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.
The dataset has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm.
Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
Assume there are 100 customers 10 of them bought milk, 8 bought butter and 6 bought both of them. bought milk => bought butter support = P(Milk & Butter) = 6/100 = 0.06 confidence = support/P(Butter) = 0.06/0.08 = 0.75 lift = confidence/P(Milk) = 0.75/0.10 = 7.5
Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.
Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.
Lift: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Accuracy and AUC value of ML algorithms using three hyper parameter tuning techniques.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The dataset is specially curated for Association Rule Learning using **Apriori and Eclat **using Python to predict Shopping Behavior.
Apriori is one of the powerful algorithms to understand association among the products. Take an example of a supermarket where most of the person buys egg also buys milk and also baking soda. Probably the reason is they want to bake a cake for new year's eve.
So we can see there is an association between eggs, milk as well as baking soda. Now after knowing such association we simply put all the 3 things together in the shelf and that definitely will increase our sales.
Let’s perform Apriori with the help of an example.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Quantitative data in the database for Example 1.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A set of records describing research papers of the various laboratories of the university of Nantes, the articles were extracted using HAL API, Each record contains the following attributes described in the table below :
| Column | Description |
|---|---|
| id | A unique identifier |
| title | the title of the article |
| abstract | the abstract of the article |
| keywords | The set of keywords associated with the article |
| authors | The article's authors |
| venue | A set of categories like : math, computer science...etc |
| date | The date of publication in ISO 8601 |
| Teams | The set of teams associated with the articles' authors |
Sample Example :
{
"id": 5096000,
"title": "Diffusion-based spectral super-resolution of third octave acoustic sensor data: is privacy at risk ?",
"abstract": "Third octave spectral recording of acoustic sensor data is an effective way of measuring the environment. While there is strong evidence that slow (1s frame, 1 Hz rate) and fast (125ms frame, 8Hz rate) versions lead by-design to unintelligible speech if reconstructed, the advent of high quality reconstruction methods based on diffusion may pose a threat, as those approaches can embed a significant amount of a priori knowledge when learned over extensive speech datasets.<p>This paper aims to assess this risk at three levels of attacks with a growing level of a priori knowledge considered at the learning of the diffusion model, a) none, b) multi-speaker data excluding the target speaker and c) target speaker. Without any prior regarding the speech profile of the speaker (levels a and b), our results suggest a rather low risk as the worderror-rate both for humans and automatic recognition remains higher than 89%.</p>",
"keywords": [
"speech privacy",
"generative audio",
"acoustic sensor networks",
"audio encoding"
],
"authors": [
"Modan Tailleur",
"Chaymae Benaatia",
"Mathieu Lagrange",
"Pierre Aumond",
"Vincent Tourre"
],
"venue": [
"info.info-ai"
],
"date": "2025-09-08 00:00:00+00:00",
"teams": [
"AAU",
"LS2N"
]
}
Note that the data is a bit noisy when it comes to the keywords, authors and venues attributes since some of these pieces of information are not filled by the authors so may find a significant number of articles having empty lists as values for these attributes.
The dataset can be used for simple multi-class classification, multi-label classification or in graph learning by handling each Article, Keyword, Author and Venue as separate nodes with relations.
Refer to this notebook for an example : https://www.kaggle.com/code/fellahabdelnour13/articles-venue-classification?scriptVersionId=268587852
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Socio-demographic characteristics among adolescent girls in Ethiopia, 2016 EDHS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Participant characteristics of the pooled data setb'*'.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMarket basket analysis with Apriori algorithm
The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.
Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.
Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.
Number of Attributes: 7
https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">
First, we need to load required libraries. Shortly I describe all libraries.
https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">
Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.
https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png">
https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">
After we will clear our data frame, will remove missing values.
https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">
To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...