45 datasets found

Market Basket Analysis
kaggle.com
Updated Dec 9, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Aslan Ahmedov
Description
Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import

Data Understanding and Exploration

Transformation of the data – so that is ready to be consumed by the association rules algorithm

Running association rules

Exploring the rules generated

Filtering the generated rules

Visualization of Rule

Dataset Description

File name: Assignment-1_Data

List name: retaildata

File format: . xlsx

Number of Row: 522065

Number of Attributes: 7

BillNo: 6-digit number assigned to each transaction. Nominal.

Itemname: Product name. Nominal.

Quantity: The quantities of each product per transaction. Numeric.

Date: The day and time when each transaction was generated. Numeric.

Price: Product price. Numeric.

CustomerID: 5-digit number assigned to each customer. Nominal.

Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).

arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.

tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.

readxl - Read Excel Files in R.

plyr - Tools for Splitting, Applying and Combining Data.

ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

knitr - Dynamic Report generation in R.

magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.

dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.

tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...
roberta-fine-tuned
kaggle.com
Updated Aug 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thibaut Juill (2023). roberta-fine-tuned [Dataset]. https://www.kaggle.com/datasets/thibautjuill/roberta-fine-tuned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 3, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Thibaut Juill
Description
Fine tuned model base on roberta-base : https://www.kaggle.com/datasets/abhishek/roberta-base

This model was trained for CommonLit - Evaluate Student Summaries competition (https://www.kaggle.com/competitions/commonlit-evaluate-student-summaries/overview). Please follow the rules of the competition before use this model.
LLM: 7 prompt training dataset
kaggle.com
Updated Nov 15, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Carl McBride Ellis (2023). LLM: 7 prompt training dataset [Dataset]. https://www.kaggle.com/datasets/carlmcbrideellis/llm-7-prompt-training-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 15, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Carl McBride Ellis
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Version 4: Adding the data from "LLM-generated essay using PaLM from Google Gen-AI" kindly generated by Kingki19 / Muhammad Rizqi.
File: train_essays_RDizzl3_seven_v2.csv
Human texts: 14247 LLM texts: 3004

See also: a new dataset of an additional 4900 LLM generated texts: LLM: Mistral-7B Instruct texts

Version 3: "**The RDizzl3 Seven**"
File: train_essays_RDizzl3_seven_v1.csv

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"A Cowboy Who Rode the Waves"

"Driverless cars"

How this dataset was made: see the notebook "LLM: Make 7 prompt train dataset"

Version 2: (train_essays_7_prompts_v2.csv) This dataset is composed of 13,712 human texts and 1638 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.

Namely:

"Car-free cities"

"Does the electoral college work?"

"Exploring Venus"

"The Face on Mars"

"Facial action coding system"

"Seeking multiple opinions"

"Phones and driving"

This dataset is a derivative of the datasets

LLM Generated Essays for the Detect AI Comp! by Radek Osmulski

persuade corpus 2.0 provided by Nicholas Broad

daigt data - llama 70b and falcon180b by Nicholas Broad

Hello, Claude! 1000 essays from Anthropic... by Darragh

as well as the original competition training dataset

Version 1:This dataset is composed of 13,712 human texts and 1165 AI-LLM generated texts originating from 7 of the PERSUADE 2.0 corpus prompts.
Online Casino Dataset (Gambling)
kaggle.com
Updated Jan 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yogendra S.R (2023). Online Casino Dataset (Gambling) [Dataset]. http://doi.org/10.34740/kaggle/dsv/4807301
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/4807301
Dataset updated
Jan 4, 2023
Dataset provided by
Kaggle
Authors
Yogendra S.R
Description
I collect this data from the Online Casino Platform (SQLITE3 Data). Game Type: 1. Dragon Tiger 20-20 2. Lucky 7 A

3. Lucky 7 B

Problem Example Game Lucky 7 A: 1) We only know Cards A to 6 = Low Card 7 = Tie Cards 8 to K = High

But we don't know the positions of the cards

2) G_id is useless.

3) In the game we can see the last 10 Game Results, and from that, we can create a probability model to predict the next game.

If you want Game Account: world 777 Note: I'm not promoting a Gambling site (The account is only for real-world testing and experience)

Theoretical Solution - Game: Lucky 7 A Total Cards in Game: 416 (8 Deck) - - High Cards Probability (144/312 = 46%) Total High Cards = 6 in one color * 4 color = 24 cards in Deck (24 * 6 = 144) - - Low Cards Probability (144/312 = 46%) Total Low Cards = 6 in one color * 4 color = 24 cards in Deck (24 * 6 = 144) - - 7 Cards Probability (24/312 = 8%) Total High Cards = 1 in one color * 4 color = 4 cards in Deck (4 * 6 = 24)

Database Objective: Create an AI to predict the Next Game.

Game Rules: Lucky 7A And Lucky 7B https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3327356%2Ff07592ade1e5f23bc00074f1cb0cf0ed%2Flucky7-rules.jpg?generation=1676109609602066&alt=media" alt="">

Game Rules: Dragon and Tiger https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3327356%2F272389ed0edf50d2d97cd8c6ad578ac1%2Fdragon-tiger-20-rules.jpg?generation=1676109713320755&alt=media" alt="">

Dataset Details: | ID | G_ID | Result | | --- | --- |--- | | Auto Inc Number | Game Round ID | Game Rasult|

Challange:

Create an AI to Find a pattern to predict the Next Move in the Game Recommended AI: - For a card guessing game, you may use Different Types of AI 1. Reinforcement Learning with Q-Learning - Basic Game 2. Reinforcement Learning with DQN - Intermediate Game (pattern recognition is not a part of the game) 3. Reinforcement Learning with DQN + LSTM/GRU - Advance Game with pattern recognition like a human learning style
Sales Dataset with Natural Language Statement
kaggle.com
Updated Oct 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gurpreet Singh India (2024). Sales Dataset with Natural Language Statement [Dataset]. https://www.kaggle.com/datasets/gurpreetsinghindia/sales-data-with-natural-language
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 1, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Gurpreet Singh India
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset contains 10,000 simulated sales transaction records, each represented in natural language with diverse sentence structures. It is designed to mimic how different users might describe the same type of transaction in varying ways, making it ideal for Natural Language Processing (NLP) tasks, text-based data extraction, and accounting automation projects.

Each record in the dataset includes the following fields:

Sale Date: The date on which the transaction took place. Customer Name: A randomly generated customer name. Product: The type of product purchased. Quantity: The quantity of the product purchased. Unit Price: The price per unit of the product. Total Amount: The total price for the purchased products. Tax Rate: The percentage of tax applied to the transaction. Payment Method: The method by which the payment was made (e.g., Credit Card, Debit Card, UPI, etc.). Sentence: A natural language description of the sales transaction. The sentence structure is varied to simulate different ways people describe the same type of sales event.

Use Cases: NLP Training: This dataset is suitable for training models to extract structured information (e.g., date, customer, amount) from natural language descriptions of sales transactions. Accounting Automation: The dataset can be used to build or test systems that automate posting of sales transactions based on unstructured text input. Text Data Preprocessing: It provides a good resource for developing methods to preprocess and standardize varying formats of text descriptions. Chatbot Training: This dataset can help train chatbots or virtual assistants that handle accounting or customer inquiries by understanding different ways of expressing the same transaction details.

Key Features: High Variability: Sentences are structured in numerous ways to simulate natural human language variations. Randomized Data: Names, dates, products, quantities, prices, and payment methods are randomized, ensuring no duplication. Multi-Field Information: Each record contains key sales information essential for accounting and business use cases.

Potential Applications: Use for Named Entity Recognition (NER) tasks. Apply for information extraction challenges. Create pattern recognition models to understand different sentence structures. Test rule-based systems or machine learning models for sales data entry and accounting automation.

License: Ensure that the dataset is appropriately licensed according to your intended use. For general public and research purposes, choose a CC0: Public Domain license, unless specific restrictions apply.
HuBMAP: 512x512 full size tiles
kaggle.com
Updated Nov 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
xhlulu (2020). HuBMAP: 512x512 full size tiles [Dataset]. https://www.kaggle.com/xhlulu/hubmap-512x512-full-size-tiles/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 17, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
xhlulu
Description
This dataset was modified from @iafoss 's notebook to create full sized 512x512px images. It has been derived from HuBMAP's competition data. By using this dataset, you acknowledge and accept the rules of the competition, which is non-exhaustively summarized below:

DATA ACCESS AND USE: Open Source

Competitions are open to residents of the United States and worldwide, except that if you are a resident of Crimea, Cuba, Iran, Syria, North Korea, Sudan, or are subject to U.S. export controls or sanctions, you may not enter the Competition. Other local rules and regulations may apply to you, so please check your local laws to ensure that you are eligible to participate in skills-based competitions. The Competition Sponsor reserves the right to award alternative Prizes where needed to comply with local laws.

COMPETITION DATA.

"Competition Data" means the data or datasets available from the Competition Website for the purpose of use in the Competition, including any prototype or executable code provided on the Competition Website. The Competition Data will contain private and public test sets. Which data belongs to which set will not be made available to participants.

A. Data Access and Use. You may access and use the Competition Data for any purpose, whether commercial or non-commercial, including for participating in the Competition and on Kaggle.com forums, and for academic research and education. The Competition Sponsor reserves the right to disqualify any participant who uses the Competition Data other than as permitted by the Competition Website and these Rules.

B. Data Security. You agree to use reasonable and suitable measures to prevent persons who have not formally agreed to these Rules from gaining access to the Competition Data. You agree not to transmit, duplicate, publish, redistribute or otherwise provide or make available the Competition Data to any party not participating in the Competition. You agree to notify Kaggle immediately upon learning of any possible unauthorized transmission of or unauthorized access to the Competition Data and agree to work with Kaggle to rectify any unauthorized transmission or access.

C. External Data. You may use data other than the Competition Data (“External Data”) to develop and test your models and Submissions. However, you will (i) ensure the External Data is available to use by all participants of the competition for purposes of the competition at no cost to the other participants and (ii) post such access to the External Data for the participants to the official competition forum prior to the Entry Deadline.
test-model
kaggle.com
Updated Dec 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The citation is currently not available for this dataset.
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 2, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alain De Long
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Alain De Long

Released under MIT

Contents
20210106 reproducibility and output
kaggle.com
Updated Jan 6, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Roberto Lofaro (2021). 20210106 reproducibility and output [Dataset]. http://doi.org/10.34740/kaggle/dsv/1822349
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/dsv/1822349
Dataset updated
Jan 6, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Roberto Lofaro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

For the Kaggle 2020 Survey contest, produced a notebook

It is also available a GitHub repository under GPL

As the notebook is structured as a report (Executive Summary, references, notes, and of course storyline-through-data), added some functionalities to either visualize within the notebook, or export the output to files.

This could be useful as a template if, as I did often in the past, you have to release both data, the reproducible analysis, as well as a report formatted according to corporate standards, while ensuring consistency across, and easier comparison across version, as well as keeping an historical archive of the evolution.

2021-01-29: released the free edition of the book, 174 pages- can be read online on issuu.com

This is the free version for online-reading only of the book with the same title that will be available on https://leanpub.com/ai-organizational-scalability by end February 2021

An experiment in transitioning to open data and free the approach to report writing done for decades with customers in activities in cultural, organizational, technological change

If you are just interested about the general concepts and approach, jump to Chapter 9 (a 10-pages narrative across the book, with hyperlinks to details)

The free version is here: https://issuu.com/robertolofaro/docs/ai-organizational-scalability-and-kaggle-survey_v1

The published edition uses hyperlinks to allow at least three different reading approaches, beside the usual sequential and serendipity-based: report structure, explanatory, and narrative about the future of Artificial Intelligence within a corporate environment

Content

This dataset contains: * a single large HTML file that is the export from Jupyter Notebook * a ZIP file containing the files generated by the notebook when run with the option to generate output as files.

Each file generated is cross-referenced to the section that produced it, and all the charts etc are: * either SVGs * or, for plotly files, HTML that can be hosted anywhere, as connect to the plotly server and allow to reproduce the chart "as is" from the notebook when the HTML was produced, without a need to access the original data or original dataset.

Along with those "visual" files, there is also a text file that is a log of the execution, as it is build by all the "print" statement within the notebook.

Acknowledgements

Obviously, as I started studying Python in March 2020, I have a huge debt with all that posted online solution to e.g. how to streamline a radar chart or a heatmap, plus countless other minutiae that I absorbed across the months, and are variously used in this notebook.

Inspiration

I worked on software projects since the 1980s, and building data-based models and presentations since then, interfacing generally with business users and (senior) managers since I was in my early 20s.

Hence, I am used to document- and wanted to use this opportunity to have a try at working on a deadline to produce something as I produced in the past in various forms, but using just a single Jupyter Notebook, Python, and a single data file (the one provided by Kaggle), in the shortest time possibile, to see if it was feasible.

There is plenty of room for improvement, but I look forward to learning more thanks to all the notebooks shared here, and contribute when (as now) I think that my past non-Python experience could be useful to bridge between data and business.

Hence, all my datasets and notebooks are generally CC BY, SA only when I want to avoid data or content risking being distorted.
Data from: Red Wine Quality
kaggle.com
zip
Updated Nov 27, 2017
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2017). Red Wine Quality [Dataset]. https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
Explore at:
zip(26176 bytes)Available download formats
Dataset updated
Nov 27, 2017
Dataset authored and provided by
UCI Machine Learning
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
Context

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (If I am mistaken and the public license type disallowed me from doing so, I will take this down if requested.)

Content

For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Tips

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'. This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value. Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm)

KNIME is a great tool (GUI) that can be used for this.
1 - File Reader (for csv) to linear correlation node and to interactive histogram for basic EDA.
2- File Reader to 'Rule Engine Node' to turn the 10 point scale to dichtome variable (good wine and rest), the code to put in the rule engine is something like this:
- $quality$ > 6.5 => "good"
- TRUE => "bad"
3- Rule Engine Node output to input of Column Filter node to filter out your original 10point feature (this prevent leaking)
4- Column Filter Node output to input of Partitioning Node (your standard train/tes split, e.g. 75%/25%, choose 'random' or 'stratified')
5- Partitioning Node train data split output to input of Train data split to input Decision Tree Learner node and
6- Partitioning Node test data split output to input Decision Tree predictor Node
7- Decision Tree learner Node output to input Decision Tree Node input
8- Decision Tree output to input ROC Node.. (here you can evaluate your model base on AUC value)

Inspiration

Use machine learning to determine which physiochemical properties make a wine 'good'!

Acknowledgements

This dataset is also available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle for convenience. (I am mistaken and the public license type disallowed me from doing so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
"Halli Galli" board game dataset
kaggle.com
Updated Mar 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wwffyy (2024). "Halli Galli" board game dataset [Dataset]. https://www.kaggle.com/datasets/wwffyy/halli-galli-board-game-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 29, 2024
Dataset provided by
Kaggle
Authors
wwffyy
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
"Halli Galli" is a tabletop game centered around quick reactions. The game consists of 54 cards featuring 4 types of fruit: bananas, lemons, strawberries, and grapes. Each card shows between 1 and 5 fruits. The main mechanism of the game is for players to take turns playing cards from their hand. When a fruit appears five times or in multiples of five, the first player to notice and ring the bell wins all the cards on the table, placing them face down in their pile. If a player rings the bell incorrectly, they must give each player one card as a penalty. The game continues until a player runs out of cards, at which point they are eliminated. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19542212%2F3ec6d4be646b01c0a5d269d473baf64a%2F3.jpg?generation=1710233513066528&alt=media" alt=""> https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F19542212%2F99e3a3c8a82cfed237eb6bff490985f3%2F20.jpg?generation=1710233543699702&alt=media" alt=""> We created a dataset for the board game "Halli Galli" to simulate different random card-playing scenarios. The dataset involves randomly placing the 54 cards on a fixed-size tabletop, generating data, and determining the wins and losses of the images based on the game rules. When the quantity of a certain fruit is 5 or a multiple of 5, the image label is 1; otherwise, it is 0. The dataset contains 20,000 training images, 2,000 validation images, and 2,000 test images.This dataset aims to explore the semantic understanding and logical reasoning abilities of various visual models in the absence of given game rules. We hope to discover a visual model with logical reasoning capabilities through this dataset, providing a new direction for development in the field of computer vision.If you want to know about the data generation code and the related models' performance on this dataset, please visit my repository:https://github.com/gitcat-404/Halli-Galli-Dataset
Fraud detection models
kaggle.com
Updated Mar 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ShyamSUBEDI (2025). Fraud detection models [Dataset]. https://www.kaggle.com/datasets/shyamsubedi/fraud-detection-models
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 12, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ShyamSUBEDI
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by ShyamSUBEDI

Released under MIT

Contents
Suicidal Ideation - Reddit Dataset
kaggle.com
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Varun (2023). Suicidal Ideation - Reddit Dataset [Dataset]. https://www.kaggle.com/datasets/rvarun11/suicidal-ideation-reddit-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 13, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Varun
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The posts were manually annotated all the posts as Suicidal or Non-Suicidal based on the following rules: 1. Suicidal Text - Posts that conveyed definite signs of suicidal ideation or even showed signs of suffering extremely from mental health illnesses like depression etc. were marked in this category due to their relation with suicidal intent. - Posts that included detailed planning of suicide or asked questions related to committing suicide, for eg. “Hello, hypothetically what would be a good way to go without loved ones knowing?”. - Posts like "I weather today is so awful that it makes me want to kill myself hahaha" were carefully removed. - These posts were marked as “1”. 2. Non Suicidal Text - Posts that did not have anything related to suicide or self-harm were marked in this category. - Posts that used words related to suicide or self-harm in the context of news or information. - Posts that talked about suicide of some other person at some other time.
- These posts were marked as “0”. This was the default category.

Our annotators included one university professor and three university students who were very carefully instructed on how to annotate each post. The instructions are given below: 1. Select only one of the two categories mentioned above. 2. To select the default category in case of any doubt. 3. To remove any ambiguous posts which seemed very confusing after discussing with other annotators. 3. Maximum 100-200 posts were to be annotated in one session to avoid any mental fatigue. 4. Since the majority of posts in the dataset were extremely long (with words > 1000), a maximum of two annotation sessions were allowed in a day.

Once the annotators completed their tasks, they were divided into pairs of two where they verified the annotations of the other annotator. Any disagreement was carefully resolved and the final annotation was mutually agreed upon by the pair. This helped in validating each annotation.
NYC Open Data
kaggle.com
zip
Updated Mar 20, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NYC Open Data (2019). NYC Open Data [Dataset]. https://www.kaggle.com/datasets/nycopendata/new-york
Explore at:
zip(0 bytes)Available download formats
Dataset updated
Mar 20, 2019
Dataset authored and provided by
NYC Open Data
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

NYC Open Data is an opportunity to engage New Yorkers in the information that is produced and used by City government. We believe that every New Yorker can benefit from Open Data, and Open Data can benefit from every New Yorker. Source: https://opendata.cityofnewyork.us/overview/

Content

Thanks to NYC Open Data, which makes public data generated by city agencies available for public use, and Citi Bike, we've incorporated over 150 GB of data in 5 open datasets into Google BigQuery Public Datasets, including:

Over 8 million 311 service requests from 2012-2016

More than 1 million motor vehicle collisions 2012-present

Citi Bike stations and 30 million Citi Bike trips 2013-present

Over 1 billion Yellow and Green Taxi rides from 2009-present

Over 500,000 sidewalk trees surveyed decennially in 1995, 2005, and 2015

This dataset is deprecated and not being updated.

Fork this kernel to get started with this dataset.

Acknowledgements

https://opendata.cityofnewyork.us/

https://cloud.google.com/blog/big-data/2017/01/new-york-city-public-datasets-now-available-on-google-bigquery

This dataset is publicly available for anyone to use under the following terms provided by the Dataset Source - https://data.cityofnewyork.us/ - and is provided "AS IS" without any warranty, express or implied, from Google. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

By accessing datasets and feeds available through NYC Open Data, the user agrees to all of the Terms of Use of NYC.gov as well as the Privacy Policy for NYC.gov. The user also agrees to any additional terms of use defined by the agencies, bureaus, and offices providing data. Public data sets made available on NYC Open Data are provided for informational purposes. The City does not warranty the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set made available on NYC Open Data, nor are any such warranties to be implied or inferred with respect to the public data sets furnished therein.

The City is not liable for any deficiencies in the completeness, accuracy, content, or fitness for any particular purpose or use of any public data set, or application utilizing such data set, provided by any third party.

Banner Photo by @bicadmedia from Unplash.

Inspiration

On which New York City streets are you most likely to find a loud party?

Can you find the Virginia Pines in New York City?

Where was the only collision caused by an animal that injured a cyclist?

What’s the Citi Bike record for the Longest Distance in the Shortest Time (on a route with at least 100 rides)?

https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png" alt="enter image description here"> https://cloud.google.com/blog/big-data/2017/01/images/148467900588042/nyc-dataset-6.png
SNN embedded model
kaggle.com
Updated Jul 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megakawa (2023). SNN embedded model [Dataset]. https://www.kaggle.com/datasets/megakawa/snn-embedded-model/versions/3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 24, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Megakawa
Description
Dataset

This dataset was created by Megakawa

Contents
Retail Store Star Schema Dataset
kaggle.com
Updated Apr 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shrinivas Vishnupurikar (2025). Retail Store Star Schema Dataset [Dataset]. https://www.kaggle.com/datasets/shrinivasv/retail-store-star-schema-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 22, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Shrinivas Vishnupurikar
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

This dataset provides a simulated retail data warehouse designed using star schema modeling principles.

It includes both normalized and denormalized versions of a retail sales star schema, making it a valuable resource for data engineers, analysts, and data warehouse enthusiasts who want to explore real-world scenarios, performance tuning, and modeling strategies.

📁 Dataset Structure

This dataset set has two Fact tables: - fact_sales_normalized.csv – No columns from the dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2F11f3c0350acd609e6b9d9336d0abb448%2FNormalized-Retail-Star-Schema.png?generation=1745327115564885&alt=media" alt="Normalized Star Schema">

fact_sales_denormalized.csv – Specific columns from certain dim_* tables have been normalised. https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F12492162%2Fb567c752c7bc8bc55d9d6142d6ac40cf%2FDenormalized-Retial-Star-Schema.png?generation=1745327148166677&alt=media" alt="Denormalized Star Schema">

However, the dim_* table stay the same for both as follows: - Dim_Customers.csv - Dim_Products.csv - Dim_Stores.csv - Dim_Dates.csv - Dim_Salesperson - Dim_Campaign

🧠 Use Cases

Practice star schema design and dimensional modeling

Learn how to denormalize dimensions for BI and analytics performance

Benchmark analytical queries (joins, aggregations, filtering)

Test data pipelines, ETL/ELT transformations, and query optimization strategies

Explore how denormalization affects storage, redundancy, and performance

📌 Notes

All data is synthetic and randomly generated via python scripts that use polars library for data manipulation— no real customer or business data is included.

Ideal for use with tools like SQL engines, Redshift, BigQuery, Snowflake, or even DuckDB.

📎 Credits

Shrinivas Vishnupurikar, Data Engineer @Velotio Technologies.
HMS ensemble models
kaggle.com
Updated Feb 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danial Zakaria (2024). HMS ensemble models [Dataset]. https://www.kaggle.com/datasets/nartaa/hms-ensemble-models/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 11, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Danial Zakaria
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset

This dataset was created by Danial Zakaria

Released under MIT

Contents
hms-model-88-data
kaggle.com
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
greySnow (2024). hms-model-88-data [Dataset]. https://www.kaggle.com/datasets/shlomoron/hms-model-88-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
greySnow
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by greySnow

Released under Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Contents
hms-model-93-data
kaggle.com
Updated Mar 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
greySnow (2024). hms-model-93-data [Dataset]. https://www.kaggle.com/datasets/shlomoron/hms-model-93-data/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 18, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
greySnow
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Dataset

This dataset was created by greySnow

Released under Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)

Contents
Simplified D&D Rules
kaggle.com
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
WeirdSal (2025). Simplified D&D Rules [Dataset]. https://www.kaggle.com/datasets/salimoradi/simplified-d-and-d-rules
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 10, 2025
Dataset provided by
Kaggle
Authors
WeirdSal
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
I used the ChatGPT O3-Mini model to generate a simplified guide for beginners in playing Dungeons & Dragons. This dataset is used in a chatbot I developed for a Q&A regarding the rules for playing D&D at the beginner level. This project was a capstone for Google's 5-day AI program.
patent-autoDL-init-model
kaggle.com
Updated Apr 29, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
medicine-wave (2022). patent-autoDL-init-model [Dataset]. https://www.kaggle.com/datasets/medicinewave/patentautodlinitmodel/suggestions
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
medicine-wave
Description
Dataset

This dataset was created by medicine-wave

Contents

Facebook

Twitter

Click to copy link

Link copied

Cite

Aslan Ahmedov (2021). Market Basket Analysis [Dataset]. https://www.kaggle.com/datasets/aslanahmedov/market-basket-analysis

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Dec 9, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Aslan Ahmedov

Description

Market Basket Analysis

Market basket analysis with Apriori algorithm

The retailer wants to target customers with suggestions on itemset that a customer is most likely to purchase .I was given dataset contains data of a retailer; the transaction data provides data around all the transactions that have happened over a period of time. Retailer will use result to grove in his industry and provide for customer suggestions on itemset, we be able increase customer engagement and improve customer experience and identify customer behavior. I will solve this problem with use Association Rules type of unsupervised learning technique that checks for the dependency of one data item on another data item.

Introduction

Association Rule is most used when you are planning to build association in different objects in a set. It works when you are planning to find frequent patterns in a transaction database. It can tell you what items do customers frequently buy together and it allows retailer to identify relationships between the items.

An Example of Association Rules

Assume there are 100 customers, 10 of them bought Computer Mouth, 9 bought Mat for Mouse and 8 bought both of them. - bought Computer Mouth => bought Mat for Mouse - support = P(Mouth & Mat) = 8/100 = 0.08 - confidence = support/P(Mat for Mouse) = 0.08/0.09 = 0.89 - lift = confidence/P(Computer Mouth) = 0.89/0.10 = 8.9 This just simple example. In practice, a rule needs the support of several hundred transactions, before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.

Strategy

Data Import
Data Understanding and Exploration
Transformation of the data – so that is ready to be consumed by the association rules algorithm
Running association rules
Exploring the rules generated
Filtering the generated rules
Visualization of Rule

Dataset Description

File name: Assignment-1_Data
List name: retaildata
File format: . xlsx
Number of Row: 522065
Number of Attributes: 7
- BillNo: 6-digit number assigned to each transaction. Nominal.
- Itemname: Product name. Nominal.
- Quantity: The quantities of each product per transaction. Numeric.
- Date: The day and time when each transaction was generated. Numeric.
- Price: Product price. Numeric.
- CustomerID: 5-digit number assigned to each customer. Nominal.
- Country: Name of the country where each customer resides. Nominal.

https://user-images.githubusercontent.com/91852182/145270162-fc53e5a3-4ad1-4d06-b0e0-228aabcf6b70.png">

Libraries in R

First, we need to load required libraries. Shortly I describe all libraries.

arules - Provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules).
arulesViz - Extends package 'arules' with various visualization. techniques for association rules and item-sets. The package also includes several interactive visualizations for rule exploration.
tidyverse - The tidyverse is an opinionated collection of R packages designed for data science.
readxl - Read Excel Files in R.
plyr - Tools for Splitting, Applying and Combining Data.
ggplot2 - A system for 'declaratively' creating graphics, based on "The Grammar of Graphics". You provide the data, tell 'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
knitr - Dynamic Report generation in R.
magrittr- Provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. There is flexible support for the type of right-hand side expressions.
dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
tidyverse - This package is designed to make it easy to install and load multiple 'tidyverse' packages in a single step.

https://user-images.githubusercontent.com/91852182/145270210-49c8e1aa-9753-431b-a8d5-99601bc76cb5.png">

Data Pre-processing

Next, we need to upload Assignment-1_Data. xlsx to R to read the dataset.Now we can see our data in R.

https://user-images.githubusercontent.com/91852182/145270229-514f0983-3bbb-4cd3-be64-980e92656a02.png"> https://user-images.githubusercontent.com/91852182/145270251-6f6f6472-8817-435c-a995-9bc4bfef10d1.png">

After we will clear our data frame, will remove missing values.

https://user-images.githubusercontent.com/91852182/145270286-05854e1a-2b6c-490e-ab30-9e99e731eacb.png">

To apply Association Rule mining, we need to convert dataframe into transaction data to make all items that are bought together in one invoice will be in ...

Clear search

Close search

Google apps

Main menu

Market Basket Analysis

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing

roberta-fine-tuned

LLM: 7 prompt training dataset

Online Casino Dataset (Gambling)

3. Lucky 7 B

3) In the game we can see the last 10 Game Results, and from that, we can create a probability model to predict the next game.

Sales Dataset with Natural Language Statement

HuBMAP: 512x512 full size tiles

test-model

Dataset

Contents

20210106 reproducibility and output

Context

Content

Acknowledgements

Inspiration

Data from: Red Wine Quality

Context

Content

Tips

Inspiration

Acknowledgements

Relevant publication

"Halli Galli" board game dataset

Fraud detection models

Dataset

Contents

Suicidal Ideation - Reddit Dataset

NYC Open Data

Context

Content

Acknowledgements

Inspiration

SNN embedded model

Dataset

Contents

Retail Store Star Schema Dataset

🛍️ Retail Star Schema (Normalized & Denormalized) – Synthetic Dataset

📁 Dataset Structure

🧠 Use Cases

📌 Notes

📎 Credits

HMS ensemble models

Dataset

Contents

hms-model-88-data

Dataset

Contents

hms-model-93-data

Dataset

Contents

Simplified D&D Rules

patent-autoDL-init-model

Dataset

Contents

Market Basket Analysis

Analyzing Consumer Behaviour Using MBA Association Rule Mining

Market Basket Analysis

Introduction

An Example of Association Rules

Strategy

Dataset Description

Libraries in R

Data Pre-processing