45 datasets found
  1. Market Basket Analysis

    • kaggle.com
    Updated Dec 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Aslan Ahmedov
    Description

    Market Basket Analysis

    Market basket analysis with Apriori algorithm

    The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

    Introduction

    Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

    An Example of Association Rules

    Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

    Strategy

    • Data Import
    • Data Understanding and Exploration
    • Transformation of the data – so that is ready to be consumed by the association rules algorithm
    • Running association rules
    • Exploring the rules generated
    • Filtering the generated rules
    • Visualization of Rule

    Dataset Description

    • File name: Assignment-1_Data
    • List name: retaildata
    • File format: . xlsx
    • Number of Row: 522065
    • Number of Attributes: 7

      • BillNo: 6-digit number assigned to each transaction. Nominal.
      • Itemname: Product name. Nominal.
      • Quantity: The quantities of each product per transaction. Numeric.
      • Date: The day and time when each transaction was generated. Numeric.
      • Price: Product price. Numeric.
      • CustomerID: 5-digit number assigned to each customer. Nominal.
      • Country: Name of the country where each customer resides. Nominal.

    imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

    Libraries in R

    First, we need to load required libraries. Shortly I describe all libraries.

    • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
    • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
    • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
    • readxl - Read Excel Files in R.
    • plyr - Tools for Splitting, Applying and Combining Data.
    • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
    • knitr - Dynamic Report generation in R.
    • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
    • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
    • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

    imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

    Data Pre-processing

    Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

    imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

    After we will clear our data frame, will remove missing values.

    imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

    To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

  2. roberta-fine-tuned

    • kaggle.com
    Updated Aug 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thibaut Juill (2023). roberta-fine-tuned [Dataset]. https://www.kaggle.com/datasets/thibautjuill/roberta-fine-tuned
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 3, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Thibaut Juill
    Description

    Fine tuned model base on roberta-base : https://www.kaggle.com/datasets/abhishek/roberta-base

    This model was trained for CommonLit - Evaluate Student Summaries competition (https://www.kaggle.com/competitions/commonlit-evaluate-student-summaries/overview). Please follow the rules of the competition before use this model.

  3. LLM: 7 prompt training dataset

    • kaggle.com
    Updated Nov 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Carl McBride Ellis
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description
    • Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
      File: train_essays_RDizzl3_seven_v2.csv
      Human texts: 14247 LLM texts: 3004

      See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts



    • Version 3: "**The RDizzl3 Seven**"
      File: train_essays_RDizzl3_seven_v1.csv

    • "Car-free cities"

    • "Does the electoral college work?"

    • "Exploring Venus"

    • "The Face on Mars"

    • "Facial action coding system"

    • "A Cowboy Who Rode the Waves"

    • "Driverless cars"

    How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

    • Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

    Namely:

    • "Car-free cities"
    • "Does the electoral college work?"
    • "Exploring Venus"
    • "The Face on Mars"
    • "Facial action coding system"
    • "Seeking multiple opinions"
    • "Phones and driving"

    This dataset is a derivative of the datasets

    as well as the original competition training dataset

    • Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
  4. Online Casino Dataset (Gambling)

    • kaggle.com
    Updated Jan 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yogendra S.R (2023). Online Casino Dataset (Gambling) [Dataset]. http://doi.org/10.34740/kaggle/dsv/4807301
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2023
    Dataset provided by
    Kaggle
    Authors
    Yogendra S.R
    Description

    I collect this data from the Online Casino Platform (SQLITE3 Data). Game Type: 1. Dragon Tiger 20-20 2. Lucky 7 A

    3. Lucky 7 B

    Problem Example Game Lucky 7 A: 1) We only know Cards A to 6 = Low Card 7 = Tie Cards 8 to K = High

    But we don't know the positions of the cards

    2) G_id is useless.

    3) In the game we can see the last 10 Game Results, and from that, we can create a probability model to predict the next game.

    If you want Game Account: world 777 Note: I'm not promoting a Gambling site (The account is only for real-world testing and experience)

    Theoretical Solution - Game: Lucky 7 A Total Cards in Game: 416 (8 Deck) - - High Cards Probability (144/312 = 46%) Total High Cards = 6 in one color * 4 color = 24 cards in Deck (24 * 6 = 144) - - Low Cards Probability (144/312 = 46%) Total Low Cards = 6 in one color * 4 color = 24 cards in Deck (24 * 6 = 144) - - 7 Cards Probability (24/312 = 8%) Total High Cards = 1 in one color * 4 color = 4 cards in Deck (4 * 6 = 24)

    Database Objective: Create an AI to predict the Next Game.

    Game Rules: Lucky 7A And Lucky 7B https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3327356%2Ff07592ade1e5f23bc00074f1cb0cf0ed%2Flucky7-rules.jpg?generation=1676109609602066&alt=media" alt="">

    Game Rules: Dragon and Tiger https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3327356%2F272389ed0edf50d2d97cd8c6ad578ac1%2Fdragon-tiger-20-rules.jpg?generation=1676109713320755&alt=media" alt="">

    Dataset Details: | ID | G_ID | Result | | --- | --- |--- | | Auto Inc Number | Game Round ID | Game Rasult|

    Challange:

    Create an AI to Find a pattern to predict the Next Move in the Game Recommended AI: - For a card guessing game, you may use Different Types of AI 1. Reinforcement Learning with Q-Learning - Basic Game 2. Reinforcement Learning with DQN - Intermediate Game (pattern recognition is not a part of the game) 3. Reinforcement Learning with DQN + LSTM/GRU - Advance Game with pattern recognition like a human learning style

  5. Sales Dataset with Natural Language Statement

    • kaggle.com
    Updated Oct 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gurpreet Singh India (2024). Sales Dataset with Natural Language Statement [Dataset]. https://www.kaggle.com/datasets/gurpreetsinghindia/sales-data-with-natural-language
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 1, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Gurpreet Singh India
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains 10,000 simulated sales transaction records, each represented in natural language with diverse sentence structures. It is designed to mimic how different users might describe the same type of transaction in varying ways, making it ideal for Natural Language Processing (NLP) tasks, text-based data extraction, and accounting automation projects.

    Each record in the dataset includes the following fields:

    Sale Date: The date on which the transaction took place. Customer Name: A randomly generated customer name. Product: The type of product purchased. Quantity: The quantity of the product purchased. Unit Price: The price per unit of the product. Total Amount: The total price for the purchased products. Tax Rate: The percentage of tax applied to the transaction. Payment Method: The method by which the payment was made (e.g., Credit Card, Debit Card, UPI, etc.). Sentence: A natural language description of the sales transaction. The sentence structure is varied to simulate different ways people describe the same type of sales event.

    Use Cases: NLP Training: This dataset is suitable for training models to extract structured information (e.g., date, customer, amount) from natural language descriptions of sales transactions. Accounting Automation: The dataset can be used to build or test systems that automate posting of sales transactions based on unstructured text input. Text Data Preprocessing: It provides a good resource for developing methods to preprocess and standardize varying formats of text descriptions. Chatbot Training: This dataset can help train chatbots or virtual assistants that handle accounting or customer inquiries by understanding different ways of expressing the same transaction details.

    Key Features: High Variability: Sentences are structured in numerous ways to simulate natural human language variations. Randomized Data: Names, dates, products, quantities, prices, and payment methods are randomized, ensuring no duplication. Multi-Field Information: Each record contains key sales information essential for accounting and business use cases.

    Potential Applications: Use for Named Entity Recognition (NER) tasks. Apply for information extraction challenges. Create pattern recognition models to understand different sentence structures. Test rule-based systems or machine learning models for sales data entry and accounting automation.

    License: Ensure that the dataset is appropriately licensed according to your intended use. For general public and research purposes, choose a CC0: Public Domain license, unless specific restrictions apply.

  6. HuBMAP: 512x512 full size tiles

    • kaggle.com
    Updated Nov 17, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    xhlulu (2020). HuBMAP: 512x512 full size tiles [Dataset]. https://www.kaggle.com/xhlulu/hubmap-512x512-full-size-tiles/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 17, 2020
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    xhlulu
    Description

    This dataset was modified from @iafoss 's notebook to create full sized 512x512px images. It has been derived from HuBMAP's competition data. By using this dataset, you acknowledge and accept the rules of the competition, which is non-exhaustively summarized below:

    DATA ACCESS AND USE: Open Source

    Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, Cuba, Iran, Syria, North Korea, Sudan, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Sponsor reserves the right to award alternative Prizes where needed to comply with local laws.

    1. COMPETITION DATA.

    "Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.

    A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.

    B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.

    C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your models and Submissions. However, you will (i) ensure the External Data is available to use by all participants of the competition for purposes of the competition at no cost to the other participants and (ii) post such access to the External Data for the participants to the official competition forum prior to the Entry Deadline.

  7. test-model

    • kaggle.com
    Updated Dec 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The citation is currently not available for this dataset.
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 2, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Alain De Long
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Alain De Long

    Released under MIT

    Contents

  8. 20210106 reproducibility and output

    • kaggle.com
    Updated Jan 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Roberto Lofaro (2021). 20210106 reproducibility and output [Dataset]. http://doi.org/10.34740/kaggle/dsv/1822349
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 6, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Roberto Lofaro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Context

    For the Kaggle 2020 Survey contest, produced a notebook

    It is also available a GitHub repository under GPL

    As the notebook is structured as a report (Executive Summary, references, notes, and of course storyline-through-data), added some functionalities to either visualize within the notebook, or export the output to files.

    This could be useful as a template if, as I did often in the past, you have to release both data, the reproducible analysis, as well as a report formatted according to corporate standards, while ensuring consistency across, and easier comparison across version, as well as keeping an historical archive of the evolution.

    2021-01-29: released the free edition of the book, 174 pages- can be read online on issuu.com

    This is the free version for online-reading only of the book with the same title that will be available on https://leanpub.com/ai-organizational-scalability by end February 2021

    An experiment in transitioning to open data and free the approach to report writing done for decades with customers in activities in cultural, organizational, technological change

    If you are just interested about the general concepts and approach, jump to Chapter 9 (a 10-pages narrative across the book, with hyperlinks to details)

    The free version is here: https://issuu.com/robertolofaro/docs/ai-organizational-scalability-and-kaggle-survey_v1

    The published edition uses hyperlinks to allow at least three different reading approaches, beside the usual sequential and serendipity-based: report structure, explanatory, and narrative about the future of Artificial Intelligence within a corporate environment

    Content

    This dataset contains: * a single large HTML file that is the export from Jupyter Notebook * a ZIP file containing the files generated by the notebook when run with the option to generate output as files.

    Each file generated is cross-referenced to the section that produced it, and all the charts etc are: * either SVGs * or, for plotly files, HTML that can be hosted anywhere, as connect to the plotly server and allow to reproduce the chart "as is" from the notebook when the HTML was produced, without a need to access the original data or original dataset.

    Along with those "visual" files, there is also a text file that is a log of the execution, as it is build by all the "print" statement within the notebook.

    Acknowledgements

    Obviously, as I started studying Python in March 2020, I have a huge debt with all that posted online solution to e.g. how to streamline a radar chart or a heatmap, plus countless other minutiae that I absorbed across the months, and are variously used in this notebook.

    Inspiration

    I worked on software projects since the 1980s, and building data-based models and presentations since then, interfacing generally with business users and (senior) managers since I was in my early 20s.

    Hence, I am used to document- and wanted to use this opportunity to have a try at working on a deadline to produce something as I produced in the past in various forms, but using just a single Jupyter Notebook, Python, and a single data file (the one provided by Kaggle), in the shortest time possibile, to see if it was feasible.

    There is plenty of room for improvement, but I look forward to learning more thanks to all the notebooks shared here, and contribute when (as now) I think that my past non-Python experience could be useful to bridge between data and business.

    Hence, all my datasets and notebooks are generally CC BY, SA only when I want to avoid data or content risking being distorted.

  9. Data from: Red Wine Quality

    • kaggle.com
    zip
    Updated Nov 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
    Explore at:
    zip(26176 bytes)Available download formats
    Dataset updated
    Nov 27, 2017
    Dataset authored and provided by
    UCI Machine Learning
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

    These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

    Content

    For more information, read [Cortez et al., 2009].
    Input variables (based on physicochemical tests):
    1 - fixed acidity
    2 - volatile acidity
    3 - citric acid
    4 - residual sugar
    5 - chlorides
    6 - free sulfur dioxide
    7 - total sulfur dioxide
    8 - density
    9 - pH
    10 - sulphates
    11 - alcohol
    Output variable (based on sensory data):
    12 - quality (score between 0 and 10)

    Tips

    What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

    KNIME is a great tool (GUI) that can be used for this.
    1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
    2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
    - $quality$ > 6.5 => "good"
    - TRUE => "bad"
    3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
    4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
    5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
    6- Partitioning Node test data split output to input Decision Tree predictor Node
    7- Decision Tree learner Node output to input Decision Tree Node input
    8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

    Inspiration

    Use machine learning to determine which physiochemical properties make a wine 'good'!

    Acknowledgements

    This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

    Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

    Relevant publication

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

  10. "Halli Galli" board game dataset

    • kaggle.com
    Updated Mar 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    wwffyy (2024). "Halli Galli" board game dataset [Dataset]. https://www.kaggle.com/datasets/wwffyy/halli-galli-board-game-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 29, 2024
    Dataset provided by
    Kaggle
    Authors
    wwffyy
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    "Halli Galli" is a tabletop game centered around quick reactions. The game consists of 54 cards featuring 4 types of fruit: bananas, lemons, strawberries, and grapes. Each card shows between 1 and 5 fruits. The main mechanism of the game is for players to take turns playing cards from their hand. When a fruit appears five times or in multiples of five, the first player to notice and ring the bell wins all the cards on the table, placing them face down in their pile. If a player rings the bell incorrectly, they must give each player one card as a penalty. The game continues until a player runs out of cards, at which point they are eliminated. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19542212%2F3ec6d4be646b01c0a5d269d473baf64a%2F3.jpg?generation=1710233513066528&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19542212%2F99e3a3c8a82cfed237eb6bff490985f3%2F20.jpg?generation=1710233543699702&alt=media" alt=""> We created a dataset for the board game "Halli Galli" to simulate different random card-playing scenarios. The dataset involves randomly placing the 54 cards on a fixed-size tabletop, generating data, and determining the wins and losses of the images based on the game rules. When the quantity of a certain fruit is 5 or a multiple of 5, the image label is 1; otherwise, it is 0. The dataset contains 20,000 training images, 2,000 validation images, and 2,000 test images.This dataset aims to explore the semantic understanding and logical reasoning abilities of various visual models in the absence of given game rules. We hope to discover a visual model with logical reasoning capabilities through this dataset, providing a new direction for development in the field of computer vision.If you want to know about the data generation code and the related models' performance on this dataset, please visit my repository:https://github.com/gitcat-404/Halli-Galli-Dataset

  11. Fraud detection models

    • kaggle.com
    Updated Mar 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ShyamSUBEDI (2025). Fraud detection models [Dataset]. https://www.kaggle.com/datasets/shyamsubedi/fraud-detection-models
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 12, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    ShyamSUBEDI
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by ShyamSUBEDI

    Released under MIT

    Contents

  12. Suicidal Ideation - Reddit Dataset

    • kaggle.com
    Updated Dec 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Varun (2023). Suicidal Ideation - Reddit Dataset [Dataset]. https://www.kaggle.com/datasets/rvarun11/suicidal-ideation-reddit-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 13, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Varun
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The posts were manually annotated all the posts as Suicidal or Non-Suicidal based on the following rules: 1. Suicidal Text - Posts that conveyed definite signs of suicidal ideation or even showed signs of suffering extremely from mental health illnesses like depression etc. were marked in this category due to their relation with suicidal intent. - Posts that included detailed planning of suicide or asked questions related to committing suicide, for eg. “Hello, hypothetically what would be a good way to go without loved ones knowing?”. - Posts like "I weather today is so awful that it makes me want to kill myself hahaha" were carefully removed. - These posts were marked as “1”. 2. Non Suicidal Text - Posts that did not have anything related to suicide or self-harm were marked in this category. - Posts that used words related to suicide or self-harm in the context of news or information. - Posts that talked about suicide of some other person at some other time.
    - These posts were marked as “0”. This was the default category.

    Our annotators included one university professor and three university students who were very carefully instructed on how to annotate each post. The instructions are given below: 1. Select only one of the two categories mentioned above. 2. To select the default category in case of any doubt. 3. To remove any ambiguous posts which seemed very confusing after discussing with other annotators. 3. Maximum 100-200 posts were to be annotated in one session to avoid any mental fatigue. 4. Since the majority of posts in the dataset were extremely long (with words > 1000), a maximum of two annotation sessions were allowed in a day.

    Once the annotators completed their tasks, they were divided into pairs of two where they verified the annotations of the other annotator. Any disagreement was carefully resolved and the final annotation was mutually agreed upon by the pair. This helped in validating each annotation.

  13. NYC Open Data

    • kaggle.com
    zip
    Updated Mar 20, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york
    Explore at:
    zip(0 bytes)Available download formats
    Dataset updated
    Mar 20, 2019
    Dataset authored and provided by
    NYC Open Data
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

    Content

    Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

    • Over 8 million 311 service requests from 2012-2016

    • More than 1 million motor vehicle collisions 2012-present

    • Citi Bike stations and 30 million Citi Bike trips 2013-present

    • Over 1 billion Yellow and Green Taxi rides from 2009-present

    • Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

    This dataset is deprecated and not being updated.

    Fork this kernel to get started with this dataset.

    Acknowledgements

    https://opendata.cityofnewyork.us/

    https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

    This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

    By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

    The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

    Banner Photo by @bicadmedia from Unplash.

    Inspiration

    On which New York City streets are you most likely to find a loud party?

    Can you find the Virginia Pines in New York City?

    Where was the only collision caused by an animal that injured a cyclist?

    What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

    https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png

  14. SNN embedded model

    • kaggle.com
    Updated Jul 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megakawa (2023). SNN embedded model [Dataset]. https://www.kaggle.com/datasets/megakawa/snn-embedded-model/versions/3
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Megakawa
    Description

    Dataset

    This dataset was created by Megakawa

    Contents

  15. Retail Store Star Schema Dataset

    • kaggle.com
    Updated Apr 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 22, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Shrinivas Vishnupurikar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

    This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

    It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

    📁 Dataset Structure

    This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

    • fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

    However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

    🧠 Use Cases

    • Practice star schema design and dimensional modeling
    • Learn how to denormalize dimensions for BI and analytics performance
    • Benchmark analytical queries (joins, aggregations, filtering)
    • Test data pipelines, ETL/ELT transformations, and query optimization strategies

    Explore how denormalization affects storage, redundancy, and performance

    📌 Notes

    All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

    Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

    📎 Credits

    Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.

  16. HMS ensemble models

    • kaggle.com
    Updated Feb 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Danial Zakaria (2024). HMS ensemble models [Dataset]. https://www.kaggle.com/datasets/nartaa/hms-ensemble-models/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 11, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Danial Zakaria
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset

    This dataset was created by Danial Zakaria

    Released under MIT

    Contents

  17. hms-model-88-data

    • kaggle.com
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    greySnow (2024). hms-model-88-data [Dataset]. https://www.kaggle.com/datasets/shlomoron/hms-model-88-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    greySnow
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by greySnow

    Released under Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

    Contents

  18. hms-model-93-data

    • kaggle.com
    Updated Mar 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    greySnow (2024). hms-model-93-data [Dataset]. https://www.kaggle.com/datasets/shlomoron/hms-model-93-data/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 18, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    greySnow
    License

    Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
    License information was derived automatically

    Description

    Dataset

    This dataset was created by greySnow

    Released under Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

    Contents

  19. Simplified D&D Rules

    • kaggle.com
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WeirdSal (2025). Simplified D&D Rules [Dataset]. https://www.kaggle.com/datasets/salimoradi/simplified-d-and-d-rules
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Kaggle
    Authors
    WeirdSal
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    I used the ChatGPT O3-Mini model to generate a simplified guide for beginners in playing Dungeons & Dragons. This dataset is used in a chatbot I developed for a Q&A regarding the rules for playing D&D at the beginner level. This project was a capstone for Google's 5-day AI program.

  20. patent-autoDL-init-model

    • kaggle.com
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    medicine-wave (2022). patent-autoDL-init-model [Dataset]. https://www.kaggle.com/datasets/medicinewave/patentautodlinitmodel/suggestions
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 29, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    medicine-wave
    Description

    Dataset

    This dataset was created by medicine-wave

    Contents

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Organization logo

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:
2 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

  • Data Import
  • Data Understanding and Exploration
  • Transformation of the data – so that is ready to be consumed by the association rules algorithm
  • Running association rules
  • Exploring the rules generated
  • Filtering the generated rules
  • Visualization of Rule

Dataset Description

  • File name: Assignment-1_Data
  • List name: retaildata
  • File format: . xlsx
  • Number of Row: 522065
  • Number of Attributes: 7

    • BillNo: 6-digit number assigned to each transaction. Nominal.
    • Itemname: Product name. Nominal.
    • Quantity: The quantities of each product per transaction. Numeric.
    • Date: The day and time when each transaction was generated. Numeric.
    • Price: Product price. Numeric.
    • CustomerID: 5-digit number assigned to each customer. Nominal.
    • Country: Name of the country where each customer resides. Nominal.

imagehttps://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

  • arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
  • arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
  • tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
  • readxl - Read Excel Files in R.
  • plyr - Tools for Splitting, Applying and Combining Data.
  • ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
  • knitr - Dynamic Report generation in R.
  • magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

imagehttps://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

imagehttps://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> imagehttps://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

imagehttps://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Search
Clear search
Close search
Google apps
Main menu