80 datasets found

Market Basket Analysis
kaggle.com
zip
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
zip(23875170 bytes)Available download formats
Dataset updated
Dec 9, 2021
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
Basket Analysis (Association Rule Mining)
kaggle.com
zip
Updated Apr 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vikram amin (2023). Basket Analysis (Association Rule Mining) [Dataset]. https://www.kaggle.com/datasets/vikramamin/basket-analysis-association-rule-mining
Explore at:
zip(345413 bytes)Available download formats
Dataset updated
Apr 25, 2023
Authors
vikram amin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The basket dataset contains a list of items available for purchase for customers. These items can be found in sets as well. For eg. milk and sugar.

The analysis being done is to ascertain for the retailers which item or sets of items are purchased. Sometimes it so happens that the purchase of an item by the customer leads the customer to purchase another item as well. It is a sort of an association of items. This is called "Association Rule Mining".

It shows which items appear together in a transaction or relation. It’s majorly used by retailers, grocery stores, an online marketplace that has a large transactional database.

We wouldn’t want to calculate all associations between every possible combination of products. Instead, we would want to select only potentially “relevant” rules from the set of all possible rules. Therefore, we use the measures support, confidence and lift to reduce the number of relationships we need to analyze.

Support says how popular an item is, as measured in the proportion of transactions in which an item set appears.

Confidence says how likely item Y is purchased when item X is purchased, Thus it is measured by the proportion of transaction with item X in which item Y also appears (Support/Antecedent (LHS)).

Lift says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is. (Confidence/Consequent (RHS))
Groceries dataset
kaggle.com
zip
Updated Sep 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Heeral Dedhia (2020). Groceries dataset [Dataset]. https://www.kaggle.com/heeraldedhia/groceries-dataset
Explore at:
zip(263057 bytes)Available download formats
Dataset updated
Sep 17, 2020
Authors
Heeral Dedhia
License
http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Description
Association Rule Mining

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Details of the dataset

The dataset has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analysed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm.

Apriori Algorithm

Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent itemsets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

An example of Association Rules

Assume there are 100 customers 10 of them bought milk, 8 bought butter and 6 bought both of them. bought milk => bought butter support = P(Milk & Butter) = 6/100 = 0.06 confidence = support/P(Butter) = 0.06/0.08 = 0.75 lift = confidence/P(Milk) = 0.75/0.10 = 7.5

Note: this example is extremely small. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Some important terms:

Support: This says how popular an itemset is, as measured by the proportion of transactions in which an itemset appears.

Confidence: This says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears.

Lift: This says how likely item Y is purchased when item X is purchased while controlling for how popular item Y is.
The collected raw Tara data set.
plos.figshare.com
zip
Updated Aug 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao (2025). The collected raw Tara data set. [Dataset]. http://doi.org/10.1371/journal.pone.0300490.s002
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300490.s002
Dataset updated
Aug 22, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Omics-wide association analysis is a very important tool for medicine and human health study. However, the modern omics data sets collected often exhibit the high-dimensionality, unknown distribution response, unknown distribution features and unknown complex association relationships between the response and its explanatory features. Reliable association analysis results depend on an accurate modeling for such data sets. Most of the existing association analysis methods rely on the specific model assumptions and lack effective false discovery rate (FDR) control. To address these limitations, the paper firstly applies a single index model for omics data. The model shows robust performance in allowing the relationships between the response variable and linear combination of covariates to be connected by any unknown monotonic link function, and both the random error and the covariates can follow any unknown distribution. Then based on this model, the paper combines rank-based approach and symmetrized data aggregation approach to develop a novel and robust feature selection method for achieving fine-mapping of risk features while controlling the false positive rate of selection. The theoretical results support the proposed method and the analysis results of simulated data show the new method possesses effective and robust performance for all the scenarios. The new method is also used to analyze the two real datasets and identifies some risk features unreported by the existing finds.
Retail Market Basket Transactions Dataset
kaggle.com
Updated Aug 25, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasiq Ali (2025). Retail Market Basket Transactions Dataset [Dataset]. https://www.kaggle.com/datasets/wasiqaliyasir/retail-market-basket-transactions-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 25, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Wasiq Ali
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
Overview

The Market_Basket_Optimisation dataset is a classic transactional dataset often used in association rule mining and market basket analysis.
It consists of multiple transactions where each transaction represents the collection of items purchased together by a customer in a single shopping trip.

File Name: Market_Basket_Optimisation.csv

Format: CSV (Comma-Separated Values)

Structure: Each row corresponds to one shopping basket. Each column in that row contains an item purchased in that basket.

Nature of Data: Transactional, categorical, sparse.

Primary Use Case: Discovering frequent itemsets and association rules to understand shopping patterns, product affinities, and to build recommender systems.

Detailed Information

📊 Dataset Composition

Transactions: 7,501 (each row = one basket).

Items (unique): Around 120 distinct products (e.g., bread, mineral water, chocolate, etc.).

Columns per row: Up to 20 possible items (not fixed; some rows have fewer, some more).

Data Type: Purely categorical (no numerical or continuous features).

Missing Values: Present in the form of empty cells (since not every basket has all 20 columns).

Duplicates: Some baskets may appear more than once — this is acceptable in transactional data as multiple customers can buy the same set of items.

🛒 Nature of Transactions

Basket Definition: Each row captures items bought together during a single visit to the store.

Variability: Basket size varies from 1 to 20 items. Some customers buy only one product, while others purchase a full set of groceries.

Sparsity: Since there are ~120 unique items but only a handful appear in each basket, the dataset is sparse. Most entries in the one-hot encoded representation are zeros.

🔎 Examples of Data

Example transaction rows (simplified):

Item 1 Item 2 Item 3 Item 4 ...
Bread Butter Jam
Mineral water Chocolate Eggs Milk
Spaghetti Tomato sauce Parmesan

Here, empty cells mean no item was purchased in that slot.

📈 Applications of This Dataset

This dataset is frequently used in data mining, analytics, and recommendation systems. Common applications include:

Association Rule Mining (Apriori, FP-Growth):

Discover rules like {Bread, Butter} ⇒ {Jam} with high support and confidence.

Identify cross-selling opportunities.

Product Affinity Analysis:

Understand which items tend to be purchased together.

Helps with store layout decisions (placing related items near each other).

Recommendation Engines:

Build systems that suggest "You may also like" products.

Example: If a customer buys pasta and tomato sauce, recommend cheese.

Marketing Campaigns:

Bundle promotions and discounts on frequently co-purchased products.

Personalized offers based on buying history.

Inventory Management:

Anticipate demand for certain product combinations.

Prevent stockouts of items that drive the purchase of others.

📌 Key Insights Potentially Hidden in the Dataset

Popular Items: Some items (like mineral water, eggs, spaghetti) occur far more frequently than others.

Product Pairs: Frequent pairs and triplets (e.g., pasta + sauce + cheese) reflect natural meal-prep combinations.

Basket Size Distribution: Most customers buy fewer than 5 items, but a small fraction buy 10+ items, showing long-tail behavior.

Seasonality (if extended with timestamps): Certain items might show peaks in demand during weekends or holidays (though timestamps are not included in this dataset).

📂 Dataset Limitations

No Customer Identifiers:

We cannot track repeated purchases by the same customer.

Analysis is limited to basket-level insights.

No Timestamps:

No temporal analysis (trends over time, seasonality) is possible.

No Quantities or Prices:

We only know whether an item was purchased, not how many units or its cost.

Sparse & Noisy:

Many baskets are small (1–2 items), which may produce weak or trivial rules.

🔮 Potential Extensions

Synthetic Timestamps: Assign simulated timestamps to study temporal buying patterns.

Add Customer IDs: If merged with external data, one can perform personalized recommendations.

Price Data: Adding cost allows for profit-driven association rules (not just frequency-based).

Deep Learning Models: Sequence models (RNNs, Transformers) could be applied if temporal ordering of items is introduced.

...
Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation...
catalog.data.gov
data.bts.gov
+3more
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Highway Administration (2023). Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation to Support Dynamic Mobility Applications (DMA) and Active Transportation and Demand Management (ATDM) Programs: Dallas Testbed Analysis Plan [supporting datasets] [Dataset]. https://catalog.data.gov/dataset/analysis-modeling-and-simulation-ams-testbed-development-and-evaluation-to-support-dynamic-d4e77
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Federal Highway Administrationhttps://highways.dot.gov/
Description
The datasets in this zip file are in support of Intelligent Transportation Systems Joint Program Office (ITS JPO) report FHWA-JPO-16-385, "Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation to Support Dynamic Mobility Applications (DMA) and Active Transportation and Demand Management (ATDM) Programs — Evaluation Report for ATDM Program," https://rosap.ntl.bts.gov/view/dot/32520 and FHWA-JPO-16-373, "Analysis, modeling, and simulation (AMS) testbed development and evaluation to support dynamic mobility applications (DMA) and active transportation and demand management (ATDM) programs : Dallas testbed analysis plan," https://rosap.ntl.bts.gov/view/dot/32106 The files in this zip file are specifically related to the Dallas Testbed. The compressed zip files total 2.2 GB in size. The files have been uploaded as-is; no further documentation was supplied by NTL. All located .docx files were converted to .pdf document files which are an open, archival format. These pdfs were then added to the zip file alongside the original .docx files. These files can be unzipped using any zip compression/decompression software. This zip file contains files in the following formats: .pdf document files which can be read using any pdf reader; .cvs text files which can be read using any text editor; .txt text files which can be read using any text editor; .docx document files which can be read in Microsoft Word and some other word processing programs; . xlsx spreadsheet files which can be read in Microsoft Excel and some other spreadsheet programs; .dat data files which may be text or multimedia; as well as GIS or mapping files in the fowlling formats: .mxd, .dbf, .prj, .sbn, .shp., .shp.xml; which may be opened in ArcGIS or other GIS software. [software requirements] These files were last accessed in 2017.
f
Data_Sheet_1_Genome-wide association analysis and admixture mapping in a...
datasetcatalog.nlm.nih.gov
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rivero, Joe; Rajabli, Farid; Beecham, Gary W.; McInerney, Katalina F.; Dalgard, Clifton L.; Scott, Kyle; Valladares, Glenies S.; Cuccaro, Michael L.; Akgun, Bilcag; Vance, Jeffery M.; Pericak-Vance, Margaret A.; Bussies, Parker L.; Hamilton-Nelson, Kara L.; Griswold, Anthony J.; Tejada, Sergio; Feliciano-Astacio, Briseida E.; Sanchez, Jose J.; Adams, Larry D. (2024). Data_Sheet_1_Genome-wide association analysis and admixture mapping in a Puerto Rican cohort supports an Alzheimer disease risk locus on chromosome 12.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001301710
Explore at:
Dataset updated
Sep 4, 2024
Authors
Rivero, Joe; Rajabli, Farid; Beecham, Gary W.; McInerney, Katalina F.; Dalgard, Clifton L.; Scott, Kyle; Valladares, Glenies S.; Cuccaro, Michael L.; Akgun, Bilcag; Vance, Jeffery M.; Pericak-Vance, Margaret A.; Bussies, Parker L.; Hamilton-Nelson, Kara L.; Griswold, Anthony J.; Tejada, Sergio; Feliciano-Astacio, Briseida E.; Sanchez, Jose J.; Adams, Larry D.
Description
IntroductionHispanic/Latino populations are underrepresented in Alzheimer Disease (AD) genetic studies. Puerto Ricans (PR), a three-way admixed (European, African, and Amerindian) population is the second-largest Hispanic group in the continental US. We aimed to conduct a genome-wide association study (GWAS) and comprehensive analyses to identify novel AD susceptibility loci and characterize known AD genetic risk loci in the PR population.Materials and methodsOur study included Whole Genome Sequencing (WGS) and phenotype data from 648 PR individuals (345 AD, 303 cognitively unimpaired). We used a generalized linear-mixed model adjusting for sex, age, population substructure, and genetic relationship matrix. To infer local ancestry, we merged the dataset with the HGDP/1000G reference panel. Subsequently, we conducted univariate admixture mapping (AM) analysis.ResultsWe identified suggestive signals within the SLC38A1 and SCN8A genes on chromosome 12q13. This region overlaps with an area of linkage of AD in previous studies (12q13) in independent data sets further supporting. Univariate African AM analysis identified one suggestive ancestral block (p = 7.2×10−6) located in the same region. The ancestry-aware approach showed that this region has both European and African ancestral backgrounds and both contributing to the risk in this region. We also replicated 11 different known AD loci -including APOE- identified in mostly European studies, which is likely due to the high European background of the PR population.ConclusionPR GWAS and AM analysis identified a suggestive AD risk locus on chromosome 12, which includes the SLC38A1 and SCN8A genes. Our findings demonstrate the importance of designing GWAS and ancestry-aware approaches and including underrepresented populations in genetic studies of AD.
f
Table_3_Applying machine-learning to rapidly analyze large qualitative text...
datasetcatalog.nlm.nih.gov
figshare.com
Updated Oct 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amlôt, Richard; Bondaronek, Paulina; Towler, Lauren; Papakonstantinou, Trisevgeni; Chadborn, Tim; Ainsworth, Ben; Yardley, Lucy (2023). Table_3_Applying machine-learning to rapidly analyze large qualitative text datasets to inform the COVID-19 pandemic response: comparing human and machine-assisted topic analysis techniques.DOCX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001083700
Explore at:
Dataset updated
Oct 31, 2023
Authors
Amlôt, Richard; Bondaronek, Paulina; Towler, Lauren; Papakonstantinou, Trisevgeni; Chadborn, Tim; Ainsworth, Ben; Yardley, Lucy
Description
IntroductionMachine-assisted topic analysis (MATA) uses artificial intelligence methods to help qualitative researchers analyze large datasets. This is useful for researchers to rapidly update healthcare interventions during changing healthcare contexts, such as a pandemic. We examined the potential to support healthcare interventions by comparing MATA with “human-only” thematic analysis techniques on the same dataset (1,472 user responses from a COVID-19 behavioral intervention).MethodsIn MATA, an unsupervised topic-modeling approach identified latent topics in the text, from which researchers identified broad themes. In human-only codebook analysis, researchers developed an initial codebook based on previous research that was applied to the dataset by the team, who met regularly to discuss and refine the codes. Formal triangulation using a “convergence coding matrix” compared findings between methods, categorizing them as “agreement”, “complementary”, “dissonant”, or “silent”.ResultsHuman analysis took much longer than MATA (147.5 vs. 40 h). Both methods identified key themes about what users found helpful and unhelpful. Formal triangulation showed both sets of findings were highly similar. The formal triangulation showed high similarity between the findings. All MATA codes were classified as in agreement or complementary to the human themes. When findings differed slightly, this was due to human researcher interpretations or nuance from human-only analysis.DiscussionResults produced by MATA were similar to human-only thematic analysis, with substantial time savings. For simple analyses that do not require an in-depth or subtle understanding of the data, MATA is a useful tool that can support qualitative researchers to interpret and analyze large datasets quickly. This approach can support intervention development and implementation, such as enabling rapid optimization during public health emergencies.
d
Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation...
catalog.data.gov
data.virginia.gov
+2more
Updated Dec 7, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Aviation Administration (2023). Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation to Support Dynamic Mobility Applications (DMA) and Active Transportation and Demand Management (ATDM) Programs: calibration Report for Phoenix Testbed [supporting datasets] [Dataset]. https://catalog.data.gov/dataset/analysis-modeling-and-simulation-ams-testbed-development-and-evaluation-to-support-dynamic
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Federal Aviation Administration
Area covered
Phoenix
Description
The datasets in this zip file are in support of FHWA-JPO-16-379, Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation to Support Dynamic Mobility Applications (DMA) and Active Transportation and Demand Management (ATDM) Programs - calibration Report for Phoenix Testbed : Final Report. The compressed zip file totals 1.1 GB in size. The zip file have been uploaded as-is; no further documentation was supplied by NTL, excepted as noted: All located .docx files were converted to .pdf document files which are an archival format. These .pdfs were then added to the zip file alongside the original .docx files. The initial zip file presented to NTL contained uncompressed datasets and duplicative zip files of the files. In order to make the overall size of the this zip file more manageable, duplicative files were deleted. The zip file can be unzipped using any zip compression/decompression software. This zip file contains files in the following formats: .pdf document files which can be read using any pdf reader; .cvs text files which can be read using any text editor; .docx document files which can be read in Microsoft Word and some other word processing programs; .txt text files which can be opened with any text editor; .xlsx spreadsheet files which can be read in Microsoft Excel and some other spreadsheet programs; .cfg computer configuration files; .db database files, which can be opened with many database programs; .rif raster image files, these files may have been created by the Corel Painter image editing application, a proprietary software program, although other image programs may open the files [software requirements]. These files were last accessed in 2017.
d
1.35 Student Support Satisfaction (summary)
catalog.data.gov
data-academy.tempe.gov
+4more
Updated Nov 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Tempe (2025). 1.35 Student Support Satisfaction (summary) [Dataset]. https://catalog.data.gov/dataset/1-35-student-support-satisfaction-summary
Explore at:
Dataset updated
Nov 15, 2025
Dataset provided by
City of Tempe
Description
This dataset provides the annual results, by school year, from the student surveys. The survey questions assess satisfaction with overall service for individuals who receive assistance from CARE 7 Youth Support Specialists. Students who receive services from Youth Specialists are given the opportunity to complete a survey regarding their satisfaction with the services provided. A student can complete a study every time they meet with a Youth Support Specialists. The survey is voluntary. Data DictionaryAdditional InformationSource: Department generated surveyContact: Maria GonzalezContact Email: Maria_Gonzalez@tempe.govData Source Type: Excel spreadsheetPreparation Method: Responses of "Very Satisfied" and "Satisfied" from two school districts are combined and summarized.Publish Frequency: AnnualPublish Method: Manual
r
UCDP External Support Dataset
researchdata.se
gimi9.com
Updated Aug 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stina Högbladh; Therése Pettersson; Lotta Themnér (2024). UCDP External Support Dataset [Dataset]. https://researchdata.se/en/catalogue/dataset/ext0034-1
Explore at:
Dataset updated
Aug 7, 2024
Dataset provided by
Uppsala University
Authors
Stina Högbladh; Therése Pettersson; Lotta Themnér
Time period covered
1975 - 2010
Description
The UCDP, Uppsala Conflict Data Program, contains information on a large number data on organised violence, armed violence, and peacemaking. There is information from 1946 up to today, and the datasets are updated continuously. The data can be downloaded for free, and available in several different versions.

The UCDP External Support Data contains information of external support in intrastate conflicts, 1975-2010. Provides information of kind of support, extern actor and specific year. The data is divided into two separate datasets which are analogous, i.e. contain identical data structured in a different manner to simplify various types of research such as different types of statistical analyses:

One dataset provide data where the unit of analysis is a warring party-year, providing information on the existence, type, and provider of external support for all warring parties (actors) coded as active in UCDP data, on an annual basis. The dataset contains information for the time-period 1975–2010. It involves 29 variables and 3606 individuals/objects.

One dataset provide data where the unit of analysis is the warring party-supporter-year, i.e. each row in the dataset contains information on the type of support that a warring party receives from a specific external party in a given year, using dummy variables for each category of support. The dataset contains information for the time-period 1975–2010. It involves 30 variables and 6519 individuals/objects.
The preprocessed HNSCC dataset, which contains 2,000 gene expression values,...
plos.figshare.com
zip
Updated Aug 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao (2025). The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can also be available. [Dataset]. http://doi.org/10.1371/journal.pone.0300490.s004
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0300490.s004
Dataset updated
Aug 22, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Zhibo Chen; Zi-Tong Lu; Xue-Ting Song; Yu-Fan Gao; Jian Xiao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The preprocessed HNSCC dataset, which contains 2,000 gene expression values, the logarithm of survival time, and a censoring indicator, can also be available.
u
Association analysis of high-low outlier road intersection crashes within...
zivahub.uct.ac.za
xlsx
Updated Jun 7, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Simone Vieira; Simon Hull; Roger Behrens (2024). Association analysis of high-low outlier road intersection crashes within the CoCT in 2017, 2018, 2019 and 2021 [Dataset]. http://doi.org/10.25375/uct.25975741.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.25375/uct.25975741.v2
Dataset updated
Jun 7, 2024
Dataset provided by
University of Cape Town
Authors
Simone Vieira; Simon Hull; Roger Behrens
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
City of Cape Town
Description
This dataset provides comprehensive information on road intersection crashes recognised as "high-low" outliers within the City of Cape Town. It includes detailed records of all intersection crashes and their corresponding crash attribute combinations, which were prevalent in at least 5% of the total "high-low" outlier road intersection crashes for the years 2017, 2018, 2019, and 2021. The dataset is meticulously organised according to support metric values, ranging from 0,05 to 0,0278, with entries presented in descending order.Data SpecificsData Type: Geospatial-temporal categorical dataFile Format: Excel document (.xlsx)Size: 675 KBNumber of Files: The dataset contains a total of 10212 association rulesDate Created: 23rd May 2024MethodologyData Collection Method: The descriptive road traffic crash data per crash victim involved in the crashes was obtained from the City of Cape Town Network InformationSoftware: ArcGIS Pro, PythonProcessing Steps: Following the spatio-temporal analyses and the derivation of "high-low" outlier fishnet grid cells from a cluster and outlier analysis, all the road intersection crashes that occurred within the "high-low" outlier fishnet grid cells were extracted to be processed by association analysis. The association analysis of these crashes was processed using Python software and involved the use of a 0,05 support metric value. Consequently, commonly occurring crash attributes among at least 5% of the "high-low" outlier road intersection crashes were extracted for inclusion in this dataset.Geospatial InformationSpatial Coverage:West Bounding Coordinate: 18°20'EEast Bounding Coordinate: 19°05'ENorth Bounding Coordinate: 33°25'SSouth Bounding Coordinate: 34°25'SCoordinate System: South African Reference System (Lo19) using the Universal Transverse Mercator projectionTemporal InformationTemporal Coverage:Start Date: 01/01/2017End Date: 31/12/2021 (2020 data omitted)
Variantscape datasets
zenodo.org
csv
Updated Apr 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marie Wosny; Marie Wosny (2025). Variantscape datasets [Dataset]. http://doi.org/10.5281/zenodo.15268056
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15268056
Dataset updated
Apr 23, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Marie Wosny; Marie Wosny
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 23, 2025
Description
Variantscape dataset
LLM-based extraction of genetic variants and biomedical entities from titles and abstracts of biomedical publications. These datasets support the analysis of literature-derived co-associations between genetic variants, cancer types, and treatments, enabling downstream network analysis, hypothesis generation, and discovery in precision oncology.

1. Dataset: Cleaned literature dataset for biomedical entity extraction (2014–2024)
"cleaned_OpenAlex.csv "
A pre-processed, cleaned, and structured dataset of cancer-related biomedical publications (2014–2024) retrieved from OpenAlex, containing titles, abstracts, and metadata curated for downstream NLP and LLM-based biomedical entity extraction.

2. Dataset: Binary entity matrix for co-association and network analysis
"dataset_for_analysis.csv"
Final binary matrix dataset derived from NLP- and LLM-based entity extraction on cancer-related literature. Entities include genetic variants, cancer types, and treatments, enabling co-occurrence and network analysis, and the investigation of literature-derived co-associations.

3. Dataset: LLM-based classification of variant-treatment co-associations
"variant_treatment_relationship_consensus.csv"
Dataset capturing LLM-based classification and consensus on co-associations between genetic variants and treatments.

4. Dataset: Metadata mapping for entity extraction and analysis
"metadata_mapping_transposed.csv "
Transposed, row-indexed metadata mapping file used for identification of each column as a variant, cancer type, treatment, study design element, or publication-derived metadata.
Data from: Multi-Source Distributed System Data for AI-powered Analytics
zenodo.org
zip
Updated Nov 10, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao (2022). Multi-Source Distributed System Data for AI-powered Analytics [Dataset]. http://doi.org/10.5281/zenodo.3549604
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3549604
Dataset updated
Nov 10, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao; Sasho Nedelkoski; Jasmin Bogatinovski; Ajay Kumar Mandapati; Soeren Becker; Jorge Cardoso; Odej Kao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract:

In recent years there has been an increased interest in Artificial Intelligence for IT Operations (AIOps). This field utilizes monitoring data from IT systems, big data platforms, and machine learning to automate various operations and maintenance (O&M) tasks for distributed systems.
The major contributions have been materialized in the form of novel algorithms.
Typically, researchers took the challenge of exploring one specific type of observability data sources, such as application logs, metrics, and distributed traces, to create new algorithms.
Nonetheless, due to the low signal-to-noise ratio of monitoring data, there is a consensus that only the analysis of multi-source monitoring data will enable the development of useful algorithms that have better performance.
Unfortunately, existing datasets usually contain only a single source of data, often logs or metrics. This limits the possibilities for greater advances in AIOps research.
Thus, we generated high-quality multi-source data composed of distributed traces, application logs, and metrics from a complex distributed system. This paper provides detailed descriptions of the experiment, statistics of the data, and identifies how such data can be analyzed to support O&M tasks such as anomaly detection, root cause analysis, and remediation.

General Information:

This repository contains the simple scripts for data statistics, and link to the multi-source distributed system dataset.

You may find details of this dataset from the original paper:

Sasho Nedelkoski, Jasmin Bogatinovski, Ajay Kumar Mandapati, Soeren Becker, Jorge Cardoso, Odej Kao, "Multi-Source Distributed System Data for AI-powered Analytics".

If you use the data, implementation, or any details of the paper, please cite!

BIBTEX:

_

@inproceedings{nedelkoski2020multi, title={Multi-source Distributed System Data for AI-Powered Analytics}, author={Nedelkoski, Sasho and Bogatinovski, Jasmin and Mandapati, Ajay Kumar and Becker, Soeren and Cardoso, Jorge and Kao, Odej}, booktitle={European Conference on Service-Oriented and Cloud Computing}, pages={161--176}, year={2020}, organization={Springer} }

_

The multi-source/multimodal dataset is composed of distributed traces, application logs, and metrics produced from running a complex distributed system (Openstack). In addition, we also provide the workload and fault scripts together with the Rally report which can serve as ground truth. We provide two datasets, which differ on how the workload is executed. The sequential_data is generated via executing workload of sequential user requests. The concurrent_data is generated via executing workload of concurrent user requests.

The raw logs in both datasets contain the same files. If the user wants the logs filetered by time with respect to the two datasets, should refer to the timestamps at the metrics (they provide the time window). In addition, we suggest to use the provided aggregated time ranged logs for both datasets in CSV format.

Important: The logs and the metrics are synchronized with respect time and they are both recorded on CEST (central european standard time). The traces are on UTC (Coordinated Universal Time -2 hours). They should be synchronized if the user develops multimodal methods. Please read the IMPORTANT_experiment_start_end.txt file before working with the data.

Our GitHub repository with the code for the workloads and scripts for basic analysis can be found at: https://github.com/SashoNedelkoski/multi-source-observability-dataset/
f
Data from: Do intrapersonal factors mediate the association of social...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 16, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abbott, Gavin; Ball, Kylie; Brug, Johannes; Timperio, Anna; Velde, Saskia J. te; Middelweerd, Anouk (2017). Do intrapersonal factors mediate the association of social support with physical activity in young women living in socioeconomically disadvantaged neighbourhoods? A longitudinal mediation analysis [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001767289
Explore at:
Dataset updated
Mar 16, 2017
Authors
Abbott, Gavin; Ball, Kylie; Brug, Johannes; Timperio, Anna; Velde, Saskia J. te; Middelweerd, Anouk
Description
BackgroundLevels of physical activity (PA) decrease when transitioning from adolescence into young adulthood. Evidence suggests that social support and intrapersonal factors (self-efficacy, outcome expectations, PA enjoyment) are associated with PA. The aim of the present study was to explore whether cross-sectional and longitudinal associations of social support from family and friends with leisure-time PA (LTPA) among young women living in disadvantaged areas were mediated by intrapersonal factors (PA enjoyment, outcome expectations, self-efficacy).MethodsSurvey data were collected from 18–30 year-old women living in disadvantaged suburbs of Victoria, Australia as part of the READI study in 2007–2008 (T0, N = 1197), with follow-up data collected in 2010–2011 (T1, N = 357) and 2012–2013 (T2, N = 271). A series of single-mediator models were tested using baseline (T0) and longitudinal data from all three time points with residual change scores for changes between measurements.ResultsCross-sectional analyses showed that social support was associated with LTPA both directly and indirectly, mediated by intrapersonal factors. Each intrapersonal factor explained between 5.9–37.5% of the associations. None of the intrapersonal factors were significant mediators in the longitudinal analyses.ConclusionsResults from the cross-sectional analyses suggest that the associations of social support from family and from friends with LTPA are mediated by intrapersonal factors (PA enjoyment, outcome expectations and self-efficacy). However, longitudinal analyses did not confirm these findings.
f
Data from: Gene-Based Association Analysis Identified Novel Genes Associated...
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 26, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mo, Xing-Bo; Lei, Shu-Feng; Zhang, Yong-Hong; Deng, Fei-Yan; Lu, Xin; Zhang, Zeng-Li (2015). Gene-Based Association Analysis Identified Novel Genes Associated with Bone Mineral Density [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001915185
Explore at:
Dataset updated
Mar 26, 2015
Authors
Mo, Xing-Bo; Lei, Shu-Feng; Zhang, Yong-Hong; Deng, Fei-Yan; Lu, Xin; Zhang, Zeng-Li
Description
Genetic factors contribute to the variation of bone mineral density (BMD), which is a major risk factor of osteoporosis. The aim of this study was to identify more “novel” genes for BMD. Based on the publicly available SNP-based P values, we performed an initial gene-based analysis in a total of 32,961 individuals. Furthermore, we performed differential expression, pathway and protein-protein interaction analyses to find supplementary evidence to support the significance of the identified genes. About 21,695 genes for femoral neck (FN)-BMD and 21,683 genes for lumbar spine (LS)-BMD were analyzed using gene-based association analysis. A total of 35 FN-BMD associated genes and 53 LS-BMD associated genes were identified (P < 2.3×10-6) after Bonferroni correction. Among them, 64 genes have not been reported in previous SNP-based genome-wide association studies. Differential expression analysis further supported the significant associations of 14 genes with FN-BMD and 19 genes with LS-BMD. Especially, WNT3 and WNT9B in the Wnt signaling pathway for FN-BMD were further supported by pathway analysis and protein-protein interaction analysis. The present study took the advantage of gene-based association method to perform a supplementary analysis of the GWAS dataset and found some BMD-associated genes. The evidence taken together supported the importance of Wnt signaling pathway genes in determining osteoporosis. Our findings provided more insights into the genetic basis of osteoporosis.
f
Table1_Genetic association-based functional analysis detects HOGA1 as a...
datasetcatalog.nlm.nih.gov
frontiersin.figshare.com
Updated Aug 12, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cho, Yoon Shin; Kim, Myungsuk; Kwak, Soo Heon; Park, Kyong Soo; Randy, Ahmad; Song, No Joon; Nho, Chu Won; Lim, Eun Bi; Park, Kye Won; Ahn, Yeongseon (2022). Table1_Genetic association-based functional analysis detects HOGA1 as a potential gene involved in fat accumulation.XLSX [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000449482
Explore at:
Dataset updated
Aug 12, 2022
Authors
Cho, Yoon Shin; Kim, Myungsuk; Kwak, Soo Heon; Park, Kyong Soo; Randy, Ahmad; Song, No Joon; Nho, Chu Won; Lim, Eun Bi; Park, Kye Won; Ahn, Yeongseon
Description
Although there are a number of discoveries from genome-wide association studies (GWAS) for obesity, it has not been successful in linking GWAS results to biology. We sought to discover causal genes for obesity by conducting functional studies on genes detected from genetic association analysis. Gene-based association analysis of 917 individual exome sequences showed that HOGA1 attains exome-wide significance (p-value < 2.7 × 10–6) for body mass index (BMI). The mRNA expression of HOGA1 is significantly increased in human adipose tissues from obese individuals in the Genotype-Tissue Expression (GTEx) dataset, which supports the genetic association of HOGA1 with BMI. Functional analyses employing cell- and animal model-based approaches were performed to gain insights into the functional relevance of Hoga1 in obesity. Adipogenesis was retarded when Hoga1 was knocked down by siRNA treatment in a mouse 3T3-L1 cell line and a similar inhibitory effect was confirmed in mice with down-regulated Hoga1. Hoga1 antisense oligonucleotide (ASO) treatment reduced body weight, blood lipid level, blood glucose, and adipocyte size in high-fat diet-induced mice. In addition, several lipogenic genes including Srebf1, Scd1, Lp1, and Acaca were down-regulated, while lipolytic genes Cpt1l, Ppara, and Ucp1 were up-regulated. Taken together, HOGA1 is a potential causal gene for obesity as it plays a role in excess body fat development.
f
Dataset for social support paper in Stata format.
datasetcatalog.nlm.nih.gov
figshare.com
Updated Jul 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Govia, Ishtar; Wilks, Rainford J.; Francis, Damian K.; Blake, Alphanso L.; Younger-Coleman, Novie O.; Ferguson, Trevor S.; McFarlane, Shelly R.; McKenzie, Joette A.; Tulloch-Reid, Marshall K.; Williams, David R.; Walters, Renee; Bennett, Nadia R. (2024). Dataset for social support paper in Stata format. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001386084
Explore at:
Dataset updated
Jul 30, 2024
Authors
Govia, Ishtar; Wilks, Rainford J.; Francis, Damian K.; Blake, Alphanso L.; Younger-Coleman, Novie O.; Ferguson, Trevor S.; McFarlane, Shelly R.; McKenzie, Joette A.; Tulloch-Reid, Marshall K.; Williams, David R.; Walters, Renee; Bennett, Nadia R.
Description
Recent studies have suggested that high levels of social support can encourage better health behaviours and result in improved cardiovascular health. In this study we evaluated the association between social support and ideal cardiovascular health among urban Jamaicans. We conducted a cross-sectional study among urban residents in Jamaica’s south-east health region. Socio-demographic data and information on cigarette smoking, physical activity, dietary practices, blood pressure, body size, cholesterol, and glucose, were collected by trained personnel. The outcome variable, ideal cardiovascular health, was defined as having optimal levels of ≥5 of these characteristics (ICH-5) according to the American Heart Association definitions. Social support exposure variables included number of friends (network size), number of friends willing to provide loans (instrumental support) and number of friends providing advice (informational support). Principal component analysis was used to create a social support score using these three variables. Survey-weighted logistic regression models were used to evaluate the association between ICH-5 and social support score. Analyses included 841 participants (279 males, 562 females) with mean age of 47.6 ± 18.42 years. ICH-5 prevalence was 26.6% (95%CI 22.3, 31.0) with no significant sex difference (male 27.5%, female 25.7%). In sex-specific, multivariable logistic regression models, social support score, was inversely associated with ICH-5 among males (OR 0.67 [95%CI 0.51, 0.89], p = 0.006) but directly associated among females (OR 1.26 [95%CI 1.04, 1.53], p = 0.020) after adjusting for age and community SES. Living in poorer communities was also significantly associated with higher odds of ICH-5 among males, while living communities with high property value was associated with higher odds of ICH among females. In this study, higher level of social support was associated with better cardiovascular health among women, but poorer cardiovascular health among men in urban Jamaica. Further research should explore these associations and identify appropriate interventions to promote cardiovascular health.
g
Michigan Public Policy Survey Restricted Use Datasets
datasearch.gesis.org
Updated Aug 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Center for Local, State, and Urban Policy (2016). Michigan Public Policy Survey Restricted Use Datasets [Dataset]. http://doi.org/10.3886/E55175V2
Explore at:
Unique identifier
https://doi.org/10.3886/E55175V2
Dataset updated
Aug 27, 2016
Dataset provided by
da|ra (Registration agency for social science and economic data)
Authors
Center for Local, State, and Urban Policy
Area covered
Michigan
Description
The Michigan Public Policy Survey (MPPS) is a program of state-wide surveys of local government leaders in Michigan. The MPPS is designed to fill an important information gap in the policymaking process. While there are ongoing surveys of the business community and of the citizens of Michigan, before the MPPS there were no ongoing surveys of local government officials that were representative of all general purpose local governments in the state. Therefore, while we knew the policy priorities and views of the state's businesses and citizens, we knew very little about the views of the local officials who are so important to the economies and community life throughout Michigan. The MPPS was launched in 2009 by the Center for Local, State, and Urban Policy (CLOSUP) at the University of Michigan and is conducted in partnership with the Michigan Association of Counties, Michigan Municipal League, and Michigan Townships Association. The associations provide CLOSUP with contact information for the survey's respondents, and consult on survey topics. CLOSUP makes all decisions on survey design, data analysis, and reporting, and receives no funding support from the associations. The surveys investigate local officials' opinions and perspectives on a variety of important public policy issues and solicit factual information about their localities relevant to policymaking. Over time, the program has covered issues such as fiscal, budgetary and operational policy, fiscal health, public sector compensation, workforce development, local-state governmental relations, intergovernmental collaboration, economic development strategies and initiatives such as placemaking and economic gardening, the role of local government in environmental sustainability, energy topics such as hydraulic fracturing ("fracking") and wind power, trust in government, views on state policymaker performance, opinions on the impacts of the Federal Stimulus Program (ARRA), and more. The program will investigate many other issues relevant to local and state policy in the future. A searchable database of every question the MPPS has asked is available on CLOSUP's website. Results of MPPS surveys are currently available as reports, and via online data tables. The MPPS datasets are being released in two forms: public-use datasets and restricted-use datasets. Unlike the public-use datasets, the restricted-use datasets represent full MPPS survey waves, and include all of the survey questions from a wave. Restricted-use datasets also allow for multiple waves to be linked together for longitudinal analysis. The MPPS staff do still modify these restricted-use datasets to remove jurisdiction and respondent identifiers and to recode other variables in order to protect confidentiality. However, it is theoretically possible that a researcher might be able, in some rare cases, to use enough variables from a full dataset to identify a unique jurisdiction, so access to these datasets is restricted and approved on a case-by-case basis. CLOSUP encourages researchers interested in the MPPS to review the codebooks included in this data collection to see the full list of variables including those not found in the public-use datasets, and to explore the MPPS data using the public-use datasets. On 2016-08-20, the openICPSR web site was moved to new software. In the migration process, some projects were not published in the new system because the decisions made in the old site did not map easily to the new setup. This project is temporarily available as restricted data while ICPSR verifies that all files were migrated correctly.

Item 1	Item 2	Item 3	Item 4
Bread	Butter	Jam
Mineral water	Chocolate	Eggs	Milk
Spaghetti	Tomato sauce	Parmesan

Facebook

Twitter

Click to copy link

Link copied

Cite

Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

zip(23875170 bytes)Available download formats

Dataset updated

Dec 9, 2021

Authors

Aslan Ahmedov

Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import
Data Understanding and Exploration
Transformation of the data – so that is ready to be consumed by the association rules algorithm
Running association rules
Exploring the rules generated
Filtering the generated rules
Visualization of Rule

Dataset Description

File name: Assignment-1_Data
List name: retaildata
File format: . xlsx
Number of Row: 522065
Number of Attributes: 7
- BillNo: 6-digit number assigned to each transaction. Nominal.
- Itemname: Product name. Nominal.
- Quantity: The quantities of each product per transaction. Numeric.
- Date: The day and time when each transaction was generated. Numeric.
- Price: Product price. Numeric.
- CustomerID: 5-digit number assigned to each customer. Nominal.
- Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
readxl - Read Excel Files in R.
plyr - Tools for Splitting, Applying and Combining Data.
ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
knitr - Dynamic Report generation in R.
magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Clear search

Close search

Google apps

Main menu

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

Basket Analysis (Association Rule Mining)

Groceries dataset

Association Rule Mining

Details of the dataset

Apriori Algorithm

An example of Association Rules

Some important terms:

The collected raw Tara data set.

Retail Market Basket Transactions Dataset

Overview

Detailed Information

📊 Dataset Composition

🛒 Nature of Transactions

🔎 Examples of Data

📈 Applications of This Dataset

📌 Key Insights Potentially Hidden in the Dataset

📂 Dataset Limitations

🔮 Potential Extensions

...

Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation...

Data_Sheet_1_Genome-wide association analysis and admixture mapping in a...

Table_3_Applying machine-learning to rapidly analyze large qualitative text...

Analysis, Modeling, and Simulation (AMS) Testbed Development and Evaluation...

1.35 Student Support Satisfaction (summary)

UCDP External Support Dataset

The preprocessed HNSCC dataset, which contains 2,000 gene expression values,...

Association analysis of high-low outlier road intersection crashes within...

Variantscape datasets

Data from: Multi-Source Distributed System Data for AI-powered Analytics

Data from: Do intrapersonal factors mediate the association of social...

Data from: Gene-Based Association Analysis Identified Novel Genes Associated...

Table1_Genetic association-based functional analysis detects HOGA1 as a...

Dataset for social support paper in Stata format.

Michigan Public Policy Survey Restricted Use Datasets

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing