100+ datasets found

SSH login attempts on a Raspberry Pi
kaggle.com
Updated Mar 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thomas (2022). SSH login attempts on a Raspberry Pi [Dataset]. https://www.kaggle.com/datasets/booroom/ssh-login-attempts-on-my-raspberry-pi
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 21, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Thomas
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I was scared when I saw the large number of connection attempts to my little raspberry pi. I had the idea to share some of them with you. Small mistake on my part, I can't get the exact date (month, year).

Content

The csv file contains : - The month - The hour when the login attempt is - The username which is used - The IP address used - The port which is used

Inspiration

What can you tell me about all these connections? Where do they come from? What are the most used usernames? Are there days when it is better to cut off the internet? At what time are the bots most active? Which port do I have to use?
Data from: Spam email Dataset
kaggle.com
Updated Sep 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
_w1998 (2023). Spam email Dataset [Dataset]. https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 1, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
_w1998
License
http://www.gnu.org/licenses/lgpl-3.0.htmlhttp://www.gnu.org/licenses/lgpl-3.0.html
Description
Dataset Name: Spam Email Dataset

Description: This dataset contains a collection of email text messages, labeled as either spam or not spam. Each email message is associated with a binary label, where "1" indicates that the email is spam, and "0" indicates that it is not spam. The dataset is intended for use in training and evaluating spam email classification models.

Columns:

text (Text): This column contains the text content of the email messages. It includes the body of the emails along with any associated subject lines or headers.

spam_or_not (Binary): This column contains binary labels to indicate whether an email is spam or not. "1" represents spam, while "0" represents not spam.

Usage: This dataset can be used for various Natural Language Processing (NLP) tasks, such as text classification and spam detection. Researchers and data scientists can train and evaluate machine learning models using this dataset to build effective spam email filters.
Daily website users
kaggle.com
Updated Feb 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bertie (2022). Daily website users [Dataset]. https://www.kaggle.com/bertiemackie/daily-website-users
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 19, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bertie
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This data set contains the number of unique customers who logged in to their accounts on a website. The value column shows this count.

Potential use cases; - timeseries modelling - in month targeting
Iris Species
kaggle.com
zip
Updated Sep 27, 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCI Machine Learning (2016). Iris Species [Dataset]. https://www.kaggle.com/datasets/uciml/iris
Explore at:
zip(3687 bytes)Available download formats
Dataset updated
Sep 27, 2016
Dataset authored and provided by
UCI Machine Learning
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

Id

SepalLengthCm

SepalWidthCm

PetalLengthCm

PetalWidthCm

Species
Pii-Mistral-2k-fit-competition
kaggle.com
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Silvestre Bahi (2024). Pii-Mistral-2k-fit-competition [Dataset]. https://www.kaggle.com/datasets/mandrilator/pii-mistral-2k-fit-competition
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 24, 2024
Dataset provided by
Kaggle
Authors
Silvestre Bahi
Description
Probabilities of getting a certain label (On v2, v1 are reverted): - name = 0.8 - email = 0.5 - phone_num = 0.3 - address= 0.3 - url= 0.5 - username= 0.5

The subject of the essay varies with the same probability among the following. I precised it was a design thinking tool: - "Visualization tool", - "Storytelling tool", - "Mind Mapping tool", - "Learning launch tool"

Model: - mistralai/Mistral-7B-Instruct-v0.2

Warnings: - I-USERNAME and I-EMAIL appear to be in the dataset - Some addresses and other entities can be split by some punctuation
Cost of Living | +144k Tweets - ENG | Aug/Sep 2022
kaggle.com
Updated Sep 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tleonel (2022). Cost of Living | +144k Tweets - ENG | Aug/Sep 2022 [Dataset]. http://doi.org/10.34740/kaggle/ds/2438280
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/2438280
Dataset updated
Sep 9, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Tleonel
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
💸💸💸 Cost of Living - 144k Tweets in English | Aug - Sept 2022 💸💸💸

UPDATED Sept 9th

The cost of living is a scorching topic. This dataset is composed of tweets sent from August 20 to Sept 9 2022, with over 144k tweets. All tweets are in English and are from different countries. Below is a breakdown of columns and the data in them.

https://images.unsplash.com/photo-1553729459-efe14ef6055d?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1770&q=80" alt="">

Columns Description

[x] date_time - Date and Time tweet was sent

[x] username - Username that sent the tweet

[x] user_location - Location entered in the account location info on Twitter

[x] user_description - Text added to "about" in account

[x] verified - If the user has the "verified by Twitter" blue tick

[x] followers_count - Number of Followers

[x] following_count - Number of accounts followed by the person who sent the tweet

[x] tweet_like_count - How many people liked the tweet

[x] tweet_retweet_count - How many people retweeted the tweet

[x] tweet_reply_count - How many people replied to that tweet

[x] source - Where was the tweet sent from. The link has info if using iPhone, Android and others

[x] tweet_text - Text sent in the tweet
Unicorn Startups (Cleaned)
kaggle.com
Updated Dec 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Niek van der Zwaag (2021). Unicorn Startups (Cleaned) [Dataset]. https://www.kaggle.com/datasets/niekvanderzwaag/unicorn-startups-cleaned
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Dec 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Niek van der Zwaag
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
In business, a unicorn is a privately held startup company valued at over $1 billion. The term was first popularised in 2013 by venture capitalist Aileen Lee, choosing the mythical animal to represent the statistical rarity of such successful ventures.

This dataset is a tidied up version of https://www.kaggle.com/ramjasmaurya/unicorn-startups/ shared by @ramjasmaurya
🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️
kaggle.com
Updated Mar 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandeep SD (2025). 🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️ [Dataset]. https://www.kaggle.com/datasets/sandeep1080/bassburst
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sandeep SD
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🌟 Enjoying the Dataset? 🌟

If this dataset helped you uncover new insights or make your day a little brighter. Thanks a ton for checking it out! Let’s keep those insights rolling! 🔥📈

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F23961675%2Ff3761bd2d7ee460ad464de8f25634f63%2Fsteve-johnson-z6LlNgsDeug-unsplash.jpg?generation=1740481184467263&alt=media" alt="">

Dataset Description:

This dataset contains website conversion data for Bluetooth speaker sales. The dataset tracks user sessions on different landing page variants, with the primary goal of analyzing conversion rates, user behavior, and other factors influencing sales. It includes detailed user engagement metrics such as time spent, pages visited, device type, sign-in methods, and geographical information.

Use Case:

This dataset can be used for various analytical tasks including:

A/B testing and multivariate analysis to compare landing page designs.
User segmentation by demographics (age, gender, location, etc.).
Conversion rate optimization (CRO) analysis.
Predictive modeling for conversion likelihood based on session characteristics.
Revenue and payment analysis.

Stroke Risk Prediction Dataset based on Literature

kaggle.com

Updated Mar 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Mahatir Ahmed Tusher (2025). Stroke Risk Prediction Dataset based on Literature [Dataset]. http://doi.org/10.34740/kaggle/dsv/10892812

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Unique identifier

https://doi.org/10.34740/kaggle/dsv/10892812

Dataset updated

Mar 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mahatir Ahmed Tusher

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Stroke Risk Prediction Dataset (Version 2)

Medically Validated, Age-Accurate, and Balanced
Samples: 35,000 | Features: 16 | Targets: 2 (Binary + Regression)

📌 Overview

This dataset is designed for predicting stroke risk using symptoms, demographics, and medical literature-inspired risk modeling. Version 2 significantly improves upon Version 1 by incorporating age-dependent symptom probabilities, gender-specific risk modifiers, and medically validated feature engineering.

Key Enhancements in Version 2:

Age-Accurate Risk Modeling:
- Stroke risk now follows a sigmoidal curve (sharp increase after age 50), reflecting real-world epidemiological trends.
- Symptom probabilities (e.g., hypertension, chest pain) scale with age (see Medical Validity).
Gender-Specific Risk:
- Males under 60 have 1.5× higher risk, while females over 60 have 1.8× higher risk (post-menopausal hormonal changes).
Balanced and Expanded Data:
- 35,000 samples (vs. 10,000 in Version 1) to improve model generalizability and capture rare symptom combinations.
- 50% at-risk (stroke risk ≥50%) and 50% not-at-risk (stroke risk <50%).

📊 Dataset Statistics

Column	Type	Description
`age`	Integer	Age (18–90)
`gender`	String	Male/Female
`chest_pain`	Binary	1 = Present, 0 = Absent
`shortness_of_breath`	Binary	1 = Present, 0 = Absent
`irregular_heartbeat`	Binary	1 = Present, 0 = Absent
`fatigue_weakness`	Binary	1 = Present, 0 = Absent
`dizziness`	Binary	1 = Present, 0 = Absent
`swelling_edema`	Binary	1 = Present, 0 = Absent
`neck_jaw_pain`	Binary	1 = Present, 0 = Absent
`excessive_sweating`	Binary	1 = Present, 0 = Absent
`persistent_cough`	Binary	1 = Present, 0 = Absent
`nausea_vomiting`	Binary	1 = Present, 0 = Absent
`high_blood_pressure`	Binary	1 = Present, 0 = Absent
`chest_discomfort`	Binary	1 = Present, 0 = Absent
`cold_hands_feet`	Binary	1 = Present, 0 = Absent
`snoring_sleep_apnea`	Binary	1 = Present, 0 = Absent
`anxiety_doom`	Binary	1 = Present, 0 = Absent
`at_risk`	Binary	Target for classification (1 = At Risk, 0 = Not At Risk)
`stroke_risk_percentage`	Float	Target for regression (0–100%)

Age distribution in Version 2 vs. Version 1
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F21100322%2F6317df05bc7526268853e24a5ce831ba%2FAge%20Distribution%20Plot.png?generation=1740875866152537&alt=media" alt="">

🔬 Medical Validity

This dataset is grounded in peer-reviewed medical literature, with symptom probabilities, risk weights, and demographic relationships directly derived from clinical guidelines and epidemiological studies. Below is a detailed breakdown of how medical knowledge was translated into dataset parameters:

1. Age-Dependent Symptom Probabilities

The prevalence of symptoms increases with age, reflecting real-world clinical observations. Probabilities are calibrated using population-level data from medical literature:

Hypertension (High Blood Pressure)

Probability by Age: 10% (18–30), 25% (31–50), 45% (51–70), 60% (71–90).
Source: WHO Global Report on Stroke (2023) identifies hypertension as the leading modifiable stroke risk factor, with prevalence rising from ~12% in adults <30 to ~65% in adults >70.
Clinical Basis: Arterial stiffness and cumulative vascular damage over time explain the age-dependent increase (Chapter 4, Harrison’s Principles of Internal Medicine).

Chest Pain

Probability by Age: 5% (18–30), 15% (31–50), 25% (51–70), 35% (71–90).
Source: The Stroke Book (Cambridge Medicine) notes that chest pain is rare in young adults but becomes prevalent in older populations due to atherosclerosis and coronary artery disease.
Clinical Basis: Atherosclerotic plaque buildup accelerates after age ...

List of all the skills
kaggle.com
Updated Aug 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ask9 (2020). List of all the skills [Dataset]. https://www.kaggle.com/datasets/arbazkhan971/allskillandnonskill
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 28, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
ask9
Description
Context

which contains all the skills from linkedin ,Github and stackoverflow and all the skills from job descriptions across different platform like naukri ,indeed and monster.com

This is the World's Largest Collection of Dataset for skills which covers all the skills.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
ICR-integer-data
kaggle.com
Updated May 27, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
raddar (2023). ICR-integer-data [Dataset]. https://www.kaggle.com/datasets/raddar/icr-integer-data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 27, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
raddar
Description
The dataset contains https://www.kaggle.com/competitions/icr-identify-age-related-conditions competition dataset transformed into integerized data. The common denominator is found for each column. Distribution of even/odd numbers were performed to identify if some values should be a fraction.

Columns 'FL' and 'GL' were untouched, probably float by nature.

Please refer to notebook for exact transformations: https://www.kaggle.com/code/raddar/convert-icr-data-to-integers
Social Media Dataset
kaggle.com
Updated Apr 17, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nixie6254 (2025). Social Media Dataset [Dataset]. https://www.kaggle.com/datasets/nixie6254/social-media-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 17, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Nixie6254
Description
This dataset consists of 734 entries representing social media activity and performance from a local SME (Micro, Small, and Medium Enterprise) across TikTok, Instagram, and Twitter platforms. It captures key metrics related to audience interaction and content strategy effectiveness, and is valuable for evaluating and optimizing digital marketing efforts for small businesses.

Area : Target location or customer region where the UMKM's content is directed. Category : The business content category (e.g., product promotion, education, seasonal campaign). Day : The day of the week the content was published. Month : The month the post went live. Platform : The social media platform used by the UMKM (TikTok, Instagram, or Twitter). Post Type : The format of the content posted: image, video, carousel, or text. Timestamp : The exact date and time when the content was posted. User : The username or business account that posted the content. Week : Week number within the year for time-based analysis. Year : The year the content was posted. Comments : Total number of comments received on the post. Engagement Rate : A calculated metric showing how engaging the content is (based on likes, comments, shares vs. reach/impressions). Hour : Hour of the day the post was published. Impressions : Number of times the content appeared on users' feeds. Likes : Number of likes the post received. Reach : Number of unique users who saw the content. Shares : Number of times users shared the content.
All ISIC Data 20240629
kaggle.com
Updated Jul 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
tomoo inubushi (2024). All ISIC Data 20240629 [Dataset]. https://www.kaggle.com/datasets/tomooinubushi/all-isic-data-20240629
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 17, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
tomoo inubushi
Description
All images and metadata in ISIC archive.

!pip install isic-cli !isic image download images/

image.hdf: Images in hdf5 format with no postprocessing

image_256sq.hdf: Images in hdf5 format with square cropping and resizing to 256x256

See also - https://www.kaggle.com/competitions/isic-2024-challenge/discussion/515356 - https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/171801 - https://www.kaggle.com/competitions/siim-isic-melanoma-classification/discussion/161943
Google Maps Restaurant Reviews
kaggle.com
Updated Aug 19, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Deniz Bilgin (2023). Google Maps Restaurant Reviews [Dataset]. https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 19, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Deniz Bilgin
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Data includes reviews of different restaurants on Google Maps. There are 1100 comments in total and pictures of each comment in the data set. The data is labeled according to 4 classes (Taste, Menu, Indoor atmosphere, Outdoor atmosphere) for the artificial intelligence to predict. The dataset has been prepared in a way that can be used in both text processing and image processing fields.

The dataset contains the following columns: business_name, author_name, text, photo, rating, rating_category

IMPORTANT: The rating_category column is related to the photo of the review. If you want to use this dataset for NLP, you need to label it yourself. I will label it for you when I am available.
Heart Disease Prediction Dataset
kaggle.com
Updated Jul 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krish Ujeniya (2024). Heart Disease Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/krishujeniya/heart-diseae
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 28, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Krish Ujeniya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Overview

This dataset contains medical data used for predicting heart disease. The data includes various attributes such as age, sex, chest pain type (cp), resting blood pressure (trestbps), cholesterol (chol), fasting blood sugar (fbs), resting electrocardiographic results (restecg), maximum heart rate achieved (thalach), exercise-induced angina (exang), and ST depression induced by exercise relative to rest (oldpeak).

Columns

age: Age of the patient (in years) sex: Sex of the patient (1 = male, 0 = female) cp: Chest pain type (1-4) trestbps: Resting blood pressure (in mm Hg on admission to the hospital) chol: Serum cholesterol in mg/dl fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) restecg: Resting electrocardiographic results (0-2) thalach: Maximum heart rate achieved exang: Exercise-induced angina (1 = yes; 0 = no) oldpeak: ST depression induced by exercise relative to rest
Multimodal Single-Cell Integration Related Data 01
kaggle.com
Updated Sep 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander Chervov (2022). Multimodal Single-Cell Integration Related Data 01 [Dataset]. https://www.kaggle.com/datasets/alexandervc/multimodal-singlecell-integration-related-data-01
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Alexander Chervov
Description
The preprocessed and "denoised" data for competition: https://www.kaggle.com/competitions/open-problems-multimodal

For further information see discussion: https://www.kaggle.com/competitions/open-problems-multimodal/discussion/350856
Incident_event_log_dataset
kaggle.com
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
winmedals (2022). Incident_event_log_dataset [Dataset]. https://www.kaggle.com/datasets/winmedals/incident-event-log-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 24, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
winmedals
Description
Source: https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log

Reposting as kaggle dataset for convenience and fast usage
Multilabel classification music emotions dataset
kaggle.com
zip
Updated Oct 25, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
srinivas365 (2018). Multilabel classification music emotions dataset [Dataset]. https://www.kaggle.com/datasets/srinivas365/multilabel-classification-emotions
Explore at:
zip(322050 bytes)Available download formats
Dataset updated
Oct 25, 2018
Authors
srinivas365
Description
Dataset

This dataset was created by srinivas365

Contents
Customer Shopping Trends Dataset
kaggle.com
Updated Oct 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sourav Banerjee (2023). Customer Shopping Trends Dataset [Dataset]. https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Oct 5, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sourav Banerjee
Description
Context

The Customer Shopping Preferences Dataset offers valuable insights into consumer behavior and purchasing patterns. Understanding customer preferences and trends is critical for businesses to tailor their products, marketing strategies, and overall customer experience. This dataset captures a wide range of customer attributes including age, gender, purchase history, preferred payment methods, frequency of purchases, and more. Analyzing this data can help businesses make informed decisions, optimize product offerings, and enhance customer satisfaction. The dataset stands as a valuable resource for businesses aiming to align their strategies with customer needs and preferences. It's important to note that this dataset is a Synthetic Dataset Created for Beginners to learn more about Data Analysis and Machine Learning.

Content

This dataset encompasses various features related to customer shopping preferences, gathering essential information for businesses seeking to enhance their understanding of their customer base. The features include customer age, gender, purchase amount, preferred payment methods, frequency of purchases, and feedback ratings. Additionally, data on the type of items purchased, shopping frequency, preferred shopping seasons, and interactions with promotional offers is included. With a collection of 3900 records, this dataset serves as a foundation for businesses looking to apply data-driven insights for better decision-making and customer-centric strategies.

Dataset Glossary (Column-wise)

Customer ID - Unique identifier for each customer

Age - Age of the customer

Gender - Gender of the customer (Male/Female)

Item Purchased - The item purchased by the customer

Category - Category of the item purchased

Purchase Amount (USD) - The amount of the purchase in USD

Location - Location where the purchase was made

Size - Size of the purchased item

Color - Color of the purchased item

Season - Season during which the purchase was made

Review Rating - Rating given by the customer for the purchased item

Subscription Status - Indicates if the customer has a subscription (Yes/No)

Shipping Type - Type of shipping chosen by the customer

Discount Applied - Indicates if a discount was applied to the purchase (Yes/No)

Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No)

Previous Purchases - The total count of transactions concluded by the customer at the store, excluding the ongoing transaction

Payment Method - Customer's most preferred payment method

Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly)

Structure of the Dataset

https://i.imgur.com/6UEqejq.png" alt="">

Acknowledgement

This dataset is a synthetic creation generated using ChatGPT to simulate a realistic customer shopping experience. Its purpose is to provide a platform for beginners and data enthusiasts, allowing them to create, enjoy, practice, and learn from a dataset that mirrors real-world customer shopping behavior. The aim is to foster learning and experimentation in a simulated environment, encouraging a deeper understanding of data analysis and interpretation in the context of consumer preferences and retail scenarios.

Cover Photo by: Freepik

Thumbnail by: Clothing icons created by Flat Icons - Flaticon
E-commerce Business Transaction
kaggle.com
Updated May 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gabriel Ramos (2022). E-commerce Business Transaction [Dataset]. https://www.kaggle.com/datasets/gabrielramos87/an-online-shop-business
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 14, 2022
Dataset provided by
Kaggle
Authors
Gabriel Ramos
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

E-commerce has become a new channel to support businesses development. Through e-commerce, businesses can get access and establish a wider market presence by providing cheaper and more efficient distribution channels for their products or services. E-commerce has also changed the way people shop and consume products and services. Many people are turning to their computers or smart devices to order goods, which can easily be delivered to their homes.

Content

This is a sales transaction data set of UK-based e-commerce (online retail) for one year. This London-based shop has been selling gifts and homewares for adults and children through the website since 2007. Their customers come from all over the world and usually make direct purchases for themselves. There are also small businesses that buy in bulk and sell to other customers through retail outlet channels.

The data set contains 500K rows and 8 columns. The following is the description of each column. 1. TransactionNo (categorical): a six-digit unique number that defines each transaction. The letter “C” in the code indicates a cancellation. 2. Date (numeric): the date when each transaction was generated. 3. ProductNo (categorical): a five or six-digit unique character used to identify a specific product. 4. Product (categorical): product/item name. 5. Price (numeric): the price of each product per unit in pound sterling (£). 6. Quantity (numeric): the quantity of each product per transaction. Negative values related to cancelled transactions. 7. CustomerNo (categorical): a five-digit unique number that defines each customer. 8. Country (categorical): name of the country where the customer resides.

There is a small percentage of order cancellation in the data set. Most of these cancellations were due to out-of-stock conditions on some products. Under this situation, customers tend to cancel an order as they want all products delivered all at once.

Inspiration

Information is a main asset of businesses nowadays. The success of a business in a competitive environment depends on its ability to acquire, store, and utilize information. Data is one of the main sources of information. Therefore, data analysis is an important activity for acquiring new and useful information. Analyze this dataset and try to answer the following questions. 1. How was the sales trend over the months? 2. What are the most frequently purchased products? 3. How many products does the customer purchase in each transaction? 4. What are the most profitable segment customers? 5. Based on your findings, what strategy could you recommend to the business to gain more profit?

Photo by CardMapr on Unsplash

Facebook

Twitter

Click to copy link

Link copied

Cite

Thomas (2022). SSH login attempts on a Raspberry Pi [Dataset]. https://www.kaggle.com/datasets/booroom/ssh-login-attempts-on-my-raspberry-pi

SSH login attempts on a Raspberry Pi

All the failed authentications on my little strawberry

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Mar 21, 2022

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Thomas

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

I was scared when I saw the large number of connection attempts to my little raspberry pi. I had the idea to share some of them with you. Small mistake on my part, I can't get the exact date (month, year).

Content

The csv file contains : - The month - The hour when the login attempt is - The username which is used - The IP address used - The port which is used

Inspiration

What can you tell me about all these connections? Where do they come from? What are the most used usernames? Are there days when it is better to cut off the internet? At what time are the bots most active? Which port do I have to use?

Clear search

Close search

Google apps

Main menu

SSH login attempts on a Raspberry Pi

Context

Content

Inspiration

Data from: Spam email Dataset

Daily website users

Iris Species

Pii-Mistral-2k-fit-competition

Cost of Living | +144k Tweets - ENG | Aug/Sep 2022

💸💸💸 Cost of Living - 144k Tweets in English | Aug - Sept 2022 💸💸💸

UPDATED Sept 9th

Columns Description

Unicorn Startups (Cleaned)

🎸🎹🎙️Speakers Sales Conversion Dataset🎸🎹🎙️

Stroke Risk Prediction Dataset based on Literature

Stroke Risk Prediction Dataset (Version 2)

📌 Overview

Key Enhancements in Version 2:

📊 Dataset Statistics

🔬 Medical Validity

1. Age-Dependent Symptom Probabilities

Hypertension (High Blood Pressure)

Chest Pain

List of all the skills

Context

Content

Acknowledgements

Inspiration

ICR-integer-data

Social Media Dataset

All ISIC Data 20240629

Google Maps Restaurant Reviews

Heart Disease Prediction Dataset

Overview

Columns

Multimodal Single-Cell Integration Related Data 01

Incident_event_log_dataset

Source: https://archive.ics.uci.edu/ml/datasets/Incident+management+process+enriched+event+log

Multilabel classification music emotions dataset

Dataset

Contents

Customer Shopping Trends Dataset

Context

Content

Dataset Glossary (Column-wise)

Structure of the Dataset

Acknowledgement

E-commerce Business Transaction

Context

Content

Inspiration

SSH login attempts on a Raspberry Pi

All the failed authentications on my little strawberry

Context

Content

Inspiration