41 datasets found
  1. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  2. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy β€” Student Success Optimization

    Hosted by: Walsoft Computer Institute πŸ“ Download dataset πŸ‘€ Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you β€” a Data Science Consultant β€” to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    β€œUsing the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    βœ… Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    βœ… Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    βœ… Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    πŸ› οΈ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    πŸ“Š Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day β€” from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge β†’ Norway, RSA β†’ South Africa, UK β†’ United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence β†’ unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors β†’ Bachelor, DIPLOMA, Diplomaaa β†’ Diploma

    • Gender value noise
      e.g. M, F, female β†’ standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    βœ… Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  3. Cleaned Auto Dataset 1985

    • kaggle.com
    zip
    Updated Oct 3, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Faisal Moiz Hussain (2021). Cleaned Auto Dataset 1985 [Dataset]. https://www.kaggle.com/faisalmoizhussain/cleaned-auto-dataset-1985
    Explore at:
    zip(10027 bytes)Available download formats
    Dataset updated
    Oct 3, 2021
    Authors
    Faisal Moiz Hussain
    Description

    Context

    Tailor made data to apply the machine learning models on the dataset. Where the newcomers can easily perform their EDA.

    The data consists of all the features of the four wheelers available in the market in 1985. We need to predict the **price of the car ** using Linear Regression or PCA or SVM-R etc.,

  4. Pakistan Online Product Sales Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aliza Brand (2025). Pakistan Online Product Sales Dataset [Dataset]. https://www.kaggle.com/datasets/shahzadi786/pakistan-online-product-sales-dataset
    Explore at:
    zip(13739 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Aliza Brand
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Pakistan
    Description

    Context

    Online e-commerce is rapidly growing in Pakistan. Sellers list thousands of products across multiple categories, each with different prices, ratings, and sales numbers. Understanding the patterns of product sales, pricing, and customer feedback is crucial for businesses and data scientists alike.

    This dataset simulates a realistic snapshot of online product sales in Pakistan, including diverse categories like Electronics, Clothing, Home & Kitchen, Books, Beauty, and Sports.

    Source

    Generated synthetically using Python and NumPy for learning and practice purposes.

    No real personal or private data is included.

    Designed specifically for Kaggle competitions, notebooks, and ML/EDA exercises.

    About the File

    File name: Pakistan_Online_Product_Sales.csv

    Rows: 1000+

    Columns: 6

    Purpose:

    Train Machine Learning models (regression/classification)

    Explore data through EDA and visualizations

    Practice feature engineering and data preprocessing

  5. Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable...

    • catalog.data.gov
    • s.cnmilf.com
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. EPA Office of Research and Development (ORD) (2023). Datasets for manuscript "Tracking end-of-life stage of chemicals: a scalable data-centric and chemical-centric approach" [Dataset]. https://catalog.data.gov/dataset/datasets-for-manuscript-tracking-end-of-life-stage-of-chemicals-a-scalable-data-centric-an
    Explore at:
    Dataset updated
    May 30, 2023
    Dataset provided by
    United States Environmental Protection Agencyhttp://www.epa.gov/
    Description

    As described in the README.md file, the GitHub repository PRTR_transfers are Python scripts written to run a data-centric and chemical-centric framework for tracking EoL chemical flow transfers, identifying potential EoL exposure scenarios, and performing Chemical Flow Analysis (CFA). Also, the created Extract, Transform, and Load (ETL) pipeline leverages publicly-accessible Pollutant Release and Transfer Register (PRTR) systems belonging to Organization for Economic Cooperation and Development (OECD) member countries. The Life Cycle Inventory (LCI) data obtained by the ETL is stored in a Structured Query Language (SQL) database called PRTR_transfers that could be connected to Machine Learning Operations (MLOps) in production environments, making the framework scalable for real-world applications. The data ingestion pipeline can supply data at an annual rate, ensuring labeled data can be ingested into data-driven models if retraining is needed, especially to face problems like data and concept drift that could drastically affect the performance of data-driven models. Also, it describes the Python libraries required for running the code, how to use it, the obtained outputs files after running the Python script, and how to obtain all manuscript figures (file Manuscript Figures-EDA.ipynb) and results. This dataset is associated with the following publication: Hernandez-Betancur, J.D., G.J. Ruiz-Mercado, and M. MartΓ­n. Tracking end-of-life stage of chemicals: A scalable data-centric and chemical-centric approach. Resources, Conservation and Recycling. Elsevier Science BV, Amsterdam, NETHERLANDS, 196: 107031, (2023).

  6. Shopping Mall Customer Data Segmentation Analysis

    • kaggle.com
    zip
    Updated Aug 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    DataZng (2024). Shopping Mall Customer Data Segmentation Analysis [Dataset]. https://www.kaggle.com/datasets/datazng/shopping-mall-customer-data-segmentation-analysis
    Explore at:
    zip(5890828 bytes)Available download formats
    Dataset updated
    Aug 4, 2024
    Authors
    DataZng
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Demographic Analysis of Shopping Behavior: Insights and Recommendations

    Dataset Information: The Shopping Mall Customer Segmentation Dataset comprises 15,079 unique entries, featuring Customer ID, age, gender, annual income, and spending score. This dataset assists in understanding customer behavior for strategic marketing planning.

    Cleaned Data Details: Data cleaned and standardized, 15,079 unique entries with attributes including - Customer ID, age, gender, annual income, and spending score. Can be used by marketing analysts to produce a better strategy for mall specific marketing.

    Challenges Faced: 1. Data Cleaning: Overcoming inconsistencies and missing values required meticulous attention. 2. Statistical Analysis: Interpreting demographic data accurately demanded collaborative effort. 3. Visualization: Crafting informative visuals to convey insights effectively posed design challenges.

    Research Topics: 1. Consumer Behavior Analysis: Exploring psychological factors driving purchasing decisions. 2. Market Segmentation Strategies: Investigating effective targeting based on demographic characteristics.

    Suggestions for Project Expansion: 1. Incorporate External Data: Integrate social media analytics or geographic data to enrich customer insights. 2. Advanced Analytics Techniques: Explore advanced statistical methods and machine learning algorithms for deeper analysis. 3. Real-Time Monitoring: Develop tools for agile decision-making through continuous customer behavior tracking. This summary outlines the demographic analysis of shopping behavior, highlighting key insights, dataset characteristics, team contributions, challenges, research topics, and suggestions for project expansion. Leveraging these insights can enhance marketing strategies and drive business growth in the retail sector.

    References OpenAI. (2022). ChatGPT [Computer software]. Retrieved from https://openai.com/chatgpt. Mustafa, Z. (2022). Shopping Mall Customer Segmentation Data [Data set]. Kaggle. Retrieved from https://www.kaggle.com/datasets/zubairmustafa/shopping-mall-customer-segmentation-data Donkeys. (n.d.). Kaggle Python API [Jupyter Notebook]. Kaggle. Retrieved from https://www.kaggle.com/code/donkeys/kaggle-python-api/notebook Pandas-Datareader. (n.d.). Retrieved from https://pypi.org/project/pandas-datareader/

  7. Medical Clean Dataset

    • kaggle.com
    zip
    Updated Jul 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aamir Shahzad (2025). Medical Clean Dataset [Dataset]. https://www.kaggle.com/datasets/aamir5659/medical-clean-dataset
    Explore at:
    zip(1262 bytes)Available download formats
    Dataset updated
    Jul 6, 2025
    Authors
    Aamir Shahzad
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This is the cleaned version of a real-world medical dataset that was originally noisy, incomplete, and contained various inconsistencies. The dataset was cleaned through a structured and well-documented data preprocessing pipeline using Python and Pandas. Key steps in the cleaning process included:

    • Handling missing values using statistical techniques such as median imputation and mode replacement
    • Converting categorical values to consistent formats (e.g., gender formatting, yes/no standardization)
    • Removing duplicate entries to ensure data accuracy
    • Parsing and standardizing date fields
    • Creating new derived features such as age groups
    • Detecting and reviewing outliers based on IQR
    • Removing irrelevant or redundant columns

    The purpose of cleaning this dataset was to prepare it for further exploratory data analysis (EDA), data visualization, and machine learning modeling.

    This cleaned dataset is now ready for training predictive models, generating visual insights, or conducting healthcare-related research. It provides a high-quality foundation for anyone interested in medical analytics or data science practice.

  8. Phishing URL Content Dataset

    • kaggle.com
    zip
    Updated Nov 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaditey Pillai (2024). Phishing URL Content Dataset [Dataset]. https://www.kaggle.com/datasets/aaditeypillai/phishing-website-content-dataset
    Explore at:
    zip(62701 bytes)Available download formats
    Dataset updated
    Nov 25, 2024
    Authors
    Aaditey Pillai
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Phishing URL Content Dataset

    Executive Summary

    Motivation:
    Phishing attacks are one of the most significant cyber threats in today’s digital era, tricking users into divulging sensitive information like passwords, credit card numbers, and personal details. This dataset aims to support research and development of machine learning models that can classify URLs as phishing or benign.

    Applications:
    - Building robust phishing detection systems.
    - Enhancing security measures in email filtering and web browsing.
    - Training cybersecurity practitioners in identifying malicious URLs.

    The dataset contains diverse features extracted from URL structures, HTML content, and website metadata, enabling deep insights into phishing behavior patterns.

    Description of Data

    This dataset comprises two types of URLs:
    1. Phishing URLs: Malicious URLs designed to deceive users. 2. Benign URLs: Legitimate URLs posing no harm to users.

    Key Features:
    - URL-based features: Domain, protocol type (HTTP/HTTPS), and IP-based links.
    - Content-based features: Link density, iframe presence, external/internal links, and metadata.
    - Certificate-based features: SSL/TLS details like validity period and organization.
    - WHOIS data: Registration details like creation and expiration dates.

    Statistics:
    - Total Samples: 800 (400 phishing, 400 benign).
    - Features: 22 including URL, domain, link density, and SSL attributes.

    Power Analysis

    To ensure statistical reliability, a power analysis was conducted to determine the minimum sample size required for binary classification with 22 features. Using a medium effect size (0.15), alpha = 0.05, and power = 0.80, the analysis indicated a minimum sample size of ~325 per class. Our dataset exceeds this requirement with 400 examples per class, ensuring robust model training.

    Exploratory Data Analysis (EDA)

    Insights from EDA:
    - Distribution Plots: Histograms and density plots for numerical features like link density, URL length, and iframe counts. - Bar Plots: Class distribution and protocol usage trends. - Correlation Heatmap: Highlights relationships between numerical features to identify multicollinearity or strong patterns. - Box Plots: For SSL certificate validity and URL lengths, comparing phishing versus benign URLs.

    EDA visualizations are provided in the repository.

    Link to Publicly Available Data and Code

    The repository contains the Python code used to extract features, conduct EDA, and build the dataset.

    Ethics Statement

    Phishing detection datasets must balance the need for security research with the risk of misuse. This dataset:
    1. Protects User Privacy: No personally identifiable information is included.
    2. Promotes Ethical Use: Intended solely for academic and research purposes.
    3. Avoids Reinforcement of Bias: Balanced class distribution ensures fairness in training models.

    Risks:
    - Misuse of the dataset for creating more deceptive phishing attacks.
    - Over-reliance on outdated features as phishing tactics evolve.

    Researchers are encouraged to pair this dataset with continuous updates and contextual studies of real-world phishing.

    Open Source License

    This dataset is shared under the MIT License, allowing free use, modification, and distribution for academic and non-commercial purposes. License details can be found here.

  9. Demon Slayer Dataset-EDA Ready

    • kaggle.com
    zip
    Updated Dec 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Laksh Lukhi (2025). Demon Slayer Dataset-EDA Ready [Dataset]. https://www.kaggle.com/datasets/lukhilaksh/demon-slayer-dataset
    Explore at:
    zip(6063 bytes)Available download formats
    Dataset updated
    Dec 1, 2025
    Authors
    Laksh Lukhi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸ—‘οΈ Demon Slayer Character Dataset (Kimetsu no Yaiba)

    Explore a comprehensive dataset of all known characters from the popular anime and manga series Kimetsu no Yaiba (Demon Slayer). This dataset includes detailed information on main, supporting, and side characters, capturing their attributes, affiliations, abilities, and story roles.

    ✨ Key Features:

    • Demographics: Gender, Age, Height, Weight, Race
    • Affiliations & Family: Demon Slayer Corps, Hashira, Demon family, Mentors
    • Abilities & Combat: Breathing styles, Weapons, Demon Arts, Special Abilities
    • Story Information: First and Last Appearance in manga/anime, Allies, Enemies, Role in Story
    • Personality & Traits: Personality description, Voice actors (Japanese & English)

    πŸ’‘ Use Cases:

    • Data analysis and visualization for anime fandom projects
    • Exploratory data analysis (EDA) for demographics, abilities, and affiliations
    • Network analysis of character relationships, allies, and enemies
    • Machine learning or AI projects exploring anime character patterns

    This dataset is ideal for anime enthusiasts, data scientists, and researchers who want to analyze character traits, relationships, and progression across the Demon Slayer universe. All data is structured and EDA-ready, making it easy to integrate into Python, R, or any data analysis tool.

    πŸ“Œ Note: All characters included are from both the anime and manga series, ensuring comprehensive coverage.

  10. Bird Migration Dataset (Data Visualization / EDA)

    • kaggle.com
    zip
    Updated May 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahir Maharaj (2025). Bird Migration Dataset (Data Visualization / EDA) [Dataset]. https://www.kaggle.com/datasets/sahirmaharajj/bird-migration-dataset-data-visualization-eda
    Explore at:
    zip(3249826 bytes)Available download formats
    Dataset updated
    May 13, 2025
    Authors
    Sahir Maharaj
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains 10,000 synthetic records simulating the migratory behavior of various bird species across global regions. Each entry represents a single bird tagged with a tracking device and includes detailed information such as flight distance, speed, altitude, weather conditions, tagging information, and migration outcomes.

    The data was entirely synthetically generated using randomized yet realistic values based on known ranges from ornithological studies. It is ideal for practicing data analysis and visualization techniques without privacy concerns or real-world data access restrictions. Because it’s artificial, the dataset can be freely used in education, portfolio projects, demo dashboards, machine learning pipelines, or business intelligence training.

    With over 40 columns, this dataset supports a wide array of analysis types. Analysts can explore questions like β€œDo certain species migrate in larger flocks?”, β€œHow does weather impact nesting success?”, or β€œWhat conditions lead to migration interruptions?”. Users can also perform geospatial mapping of start and end locations, cluster birds by behavior, or build time series models based on migration months and environmental factors.

    For data visualization, tools like Power BI, Python (Matplotlib/Seaborn/Plotly), or Excel can be used to create insightful dashboards and interactive charts.

    Join the Fabric Community DataViz Contest | May 2025: https://community.fabric.microsoft.com/t5/Power-BI-Community-Blog/%EF%B8%8F-Fabric-Community-DataViz-Contest-May-2025/ba-p/4668560

  11. AI Assistant Usage in Student Life

    • kaggle.com
    Updated Jun 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ayesha Saleem (2025). AI Assistant Usage in Student Life [Dataset]. https://www.kaggle.com/datasets/ayeshasal89/ai-assistant-usage-in-student-life-synthetic/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jun 25, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ayesha Saleem
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description
    If you find this dataset useful, a quick upvote would be greatly appreciated πŸ™Œ It helps more learners discover it!
    

    AI Assistant Usage in Student Life

    Explore how students at different academic levels use AI tools like ChatGPT for tasks such as coding, writing, studying, and brainstorming. Designed for learning, EDA, and ML experimentation.

    What is this dataset?

    This dataset simulates 10,000 sessions of students interacting with an AI assistant (like ChatGPT or similar tools) for various academic tasks. Each row represents a single session, capturing the student’s level, discipline, type of task, session length, AI effectiveness, satisfaction rating, and whether they reused the AI tool later.

    Why was this dataset created?

    As AI tools become mainstream in education, there's a need to analyze and model how students interact with them. However, no public datasets exist for this behavior. This dataset fills that gap by providing a safe, fully synthetic yet realistic simulation for:

    • EDA and visualization practice
    • Machine learning modeling
    • Feature engineering workflows
    • Educational data science exploration

    It’s ideal for students, data science learners, and researchers who want real-world use cases without privacy or copyright constraints.

    How is the dataset structured?

    ColumnDescription
    SessionIDUnique session identifier
    StudentLevelAcademic level: High School, Undergraduate, Graduate
    DisciplineStudent’s field of study (e.g., CS, Psychology, etc.)
    SessionDateDate of the session
    SessionLengthMinLength of AI interaction in minutes
    TotalPromptsNumber of prompts/messages used
    TaskTypeNature of the task (e.g., Coding, Writing, Research)
    AI_AssistanceLevel1–5 scale on how helpful the AI was perceived to be
    FinalOutcomeWhat the student achieved: Assignment Completed, Idea Drafted, etc.
    UsedAgainWhether the student returned to use the assistant again
    SatisfactionRating1–5 rating of overall satisfaction with the session

    All data is synthetically generated using controlled distributions, real-world logic, and behavioral modeling to reflect realistic usage patterns.

    Possible Use Cases

    This dataset is rich with potential for:

    • EDA: Visualize session behavior across levels, tasks, or disciplines
    • Classification: Predict likelihood of reuse (UsedAgain) or final outcome
    • Regression: Model satisfaction or session length based on context
    • Clustering: Segment students by AI interaction behavior
    • Feature engineering practice: Derive prompt density, session efficiency, or task difficulty
    • Survey-style analysis: Discover what makes students satisfied or frustrated

    Key Features

    • Clean and ready-to-use CSV
    • Balanced and realistic distributions
    • No missing values
    • Highly relatable academic context
  12. 2025 Kaggle Machine Learning & Data Science Survey

    • kaggle.com
    Updated Jan 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hina Ismail (2025). 2025 Kaggle Machine Learning & Data Science Survey [Dataset]. https://www.kaggle.com/datasets/sonialikhan/2025-kaggle-machine-learning-and-data-science-survey
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 28, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Hina Ismail
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview Welcome to Kaggle's second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.

    This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!

    There's a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We've published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.

    Challenge This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..

    In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.

    The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A β€œstory” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

    Submissions will be evaluated on the following:

    Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations. Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time. Documentation - Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.

    While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.

    How to Participate To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.

    No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.

    Timeline All dates are 11:59PM UTC

    Submission deadline: December 3rd

    Winners announced: December 10th

    Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th

    All kernels are evaluated after the deadline.

    Rules To be eligible to win a prize in either of the above prize tracks, you must be:

    a registered account holder at Kaggle.com; the older of 18 years old or the age of majority in your jurisdiction of residence; and not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.

    Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.

    Survey Methodology ...

  13. Toy Dataset

    • kaggle.com
    zip
    Updated Dec 10, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carlo Lepelaars (2018). Toy Dataset [Dataset]. https://www.kaggle.com/datasets/carlolepelaars/toy-dataset
    Explore at:
    zip(1184308 bytes)Available download formats
    Dataset updated
    Dec 10, 2018
    Authors
    Carlo Lepelaars
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    A fictional dataset for exploratory data analysis (EDA) and to test simple prediction models.

    This toy dataset features 150000 rows and 6 columns.

    Columns

    Note: All data is fictional. The data has been generated so that their distributions are convenient for statistical analysis.

    Number: A simple index number for each row

    City: The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)

    Gender: Gender of a person (Male or Female)

    Age: The age of a person (Ranging from 25 to 65 years)

    Income: Annual income of a person (Ranging from -674 to 177175)

    Illness: Is the person Ill? (Yes or No)

    Acknowledgements

    Stock photo by Mika Baumeister on Unsplash.

  14. Preventive Maintenance for Marine Engines

    • kaggle.com
    zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fijabi J. Adekunle (2025). Preventive Maintenance for Marine Engines [Dataset]. https://www.kaggle.com/datasets/jeleeladekunlefijabi/preventive-maintenance-for-marine-engines
    Explore at:
    zip(436025 bytes)Available download formats
    Dataset updated
    Feb 12, 2025
    Authors
    Fijabi J. Adekunle
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Preventive Maintenance for Marine Engines: Data-Driven Insights

    Introduction:

    Marine engine failures can lead to costly downtime, safety risks and operational inefficiencies. This project leverages machine learning to predict maintenance needs, helping ship operators prevent unexpected breakdowns. Using a simulated dataset, we analyze key engine parameters and develop predictive models to classify maintenance status into three categories: Normal, Requires Maintenance, and Critical.

    Overview This project explores preventive maintenance strategies for marine engines by analyzing operational data and applying machine learning techniques.

    Key steps include: 1. Data Simulation: Creating a realistic dataset with engine performance metrics. 2. Exploratory Data Analysis (EDA): Understanding trends and patterns in engine behavior. 3. Model Training & Evaluation: Comparing machine learning models (Decision Tree, Random Forest, XGBoost) to predict maintenance needs. 4. Hyperparameter Tuning: Using GridSearchCV to optimize model performance.

    Tools Used 1. Python: Data processing, analysis and modeling 2. Pandas & NumPy: Data manipulation 3. Scikit-Learn & XGBoost: Machine learning model training 4. Matplotlib & Seaborn: Data visualization

    Skills Demonstrated βœ” Data Simulation & Preprocessing βœ” Exploratory Data Analysis (EDA) βœ” Feature Engineering & Encoding βœ” Supervised Machine Learning (Classification) βœ” Model Evaluation & Hyperparameter Tuning

    Key Insights & Findings πŸ“Œ Engine Temperature & Vibration Level: Strong indicators of potential failures. πŸ“Œ Random Forest vs. XGBoost: After hyperparameter tuning, both models achieved comparable performance, with Random Forest performing slightly better. πŸ“Œ Maintenance Status Distribution: Balanced dataset ensures unbiased model training. πŸ“Œ Failure Modes: The most common issues were Mechanical Wear & Oil Leakage, aligning with real-world engine failure trends.

    Challenges Faced 🚧 Simulating Realistic Data: Ensuring the dataset reflects real-world marine engine behavior was a key challenge. 🚧 Model Performance: The accuracy was limited (~35%) due to the complexity of failure prediction. 🚧 Feature Selection: Identifying the most impactful features required extensive analysis.

    Call to Action πŸ” Explore the Dataset & Notebook: Try running different models and tweaking hyperparameters. πŸ“Š Extend the Analysis: Incorporate additional sensor data or alternative machine learning techniques. πŸš€ Real-World Application: This approach can be adapted for industrial machinery, aircraft engines, and power plants.

  15. Newborn Health Monitoring Dataset

    • kaggle.com
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Newborn Health Monitoring Dataset [Dataset]. https://www.kaggle.com/datasets/miadul/newborn-health-monitoring-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸ“Œ Introduction

    This dataset is a synthetic yet realistic simulation of newborn baby health monitoring.
    It is designed for healthcare analytics, machine learning, and app development, especially for early detection of newborn health risks.

    The dataset mimics daily health records of newborn babies, including vital signs, growth parameters, feeding patterns, and risk classification labels.

    🎯 Motivation

    Newborn health is one of the most sensitive areas of healthcare.
    Monitoring newborns can help detect jaundice, infections, dehydration, and respiratory issues early.

    Since real newborn data is private and hard to access, this dataset provides a safe and realistic alternative for researchers, students, and developers to build and test:
    - πŸ“Š Exploratory Data Analysis (EDA)
    - πŸ€– Machine Learning classification models
    - πŸ“± Healthcare monitoring apps (Streamlit, Flask, Django, etc.)
    - πŸ₯ Predictive healthcare systems

    πŸ“‚ Dataset Overview

    • Total Babies: 100
    • Monitoring Period: 30 days per baby
    • Total Records: 3,000
    • File Format: CSV
    • Synthetic Data: Generated using Python (pandas, numpy, faker) with medically-informed rules

    πŸ“‘ Column Description

    πŸ”Ή Demographics

    • baby_id β†’ Unique identifier for each baby (e.g., B001).
    • name β†’ Randomly generated baby first name (for realism).
    • gender β†’ Male / Female.
    • gestational_age_weeks β†’ Gestational age at birth (normal: 37–42 weeks).
    • birth_weight_kg β†’ Birth weight (normal range: 2.5–4.5 kg).
    • birth_length_cm β†’ Length at birth (avg: 48–52 cm).
    • birth_head_circumference_cm β†’ Head circumference at birth (avg: 33–35 cm).

    πŸ”Ή Daily Monitoring

    • date β†’ Monitoring date.
    • age_days β†’ Age of baby in days since birth.
    • weight_kg β†’ Daily updated weight (growth trend ~25–30g/day).
    • length_cm β†’ Daily updated body length (slow increase).
    • head_circumference_cm β†’ Daily updated head circumference.
    • temperature_c β†’ Body temperature in Β°C (normal: 36.5–37.5Β°C).
    • heart_rate_bpm β†’ Heart rate (normal: 120–160 bpm).
    • respiratory_rate_bpm β†’ Breathing rate (normal: 30–60 breaths/min).
    • oxygen_saturation β†’ SpOβ‚‚ level (normal >95%).

    πŸ”Ή Feeding & Hydration

    • feeding_type β†’ Breastfeeding / Formula / Mixed.
    • feeding_frequency_per_day β†’ Number of feeds per day (normal: 8–12).
    • urine_output_count β†’ Wet diapers/day (normal: 6–8+).
    • stool_count β†’ Bowel movements per day (0–5 is common).

    πŸ”Ή Medical Screening

    • jaundice_level_mg_dl β†’ Bilirubin level (normal <5, mild 5–12, severe >15).
    • apgar_score β†’ 0–10 score at birth (only day 1).
    • immunizations_done β†’ Yes/No (BCG, HepB, OPV on Day 1 & 30).
    • reflexes_normal β†’ Newborn reflex check (Yes/No).

    πŸ”Ή Risk Classification

    • risk_level β†’ Automatically assigned health status:
      • βœ… Healthy β†’ All vitals normal.
      • ⚠️ At Risk β†’ Mild abnormalities (e.g., mild jaundice, slight fever, SpOβ‚‚ 92–95%).
      • 🚨 Critical β†’ Severe abnormalities (e.g., jaundice >15, SpOβ‚‚ <92, HR >180, temp >39Β°C).

    πŸ“Š How Data Was Generated

    The dataset was generated in Python using:
    - numpy and pandas for data simulation.
    - faker for generating baby names and dates.
    - Medically realistic rules for vitals, growth, jaundice progression, and risk classification.

    πŸ’‘ Potential Applications

    • Machine Learning: Train classification models to predict newborn health risks.
    • Streamlit/Dash Apps: Build real-time newborn monitoring dashboards.
    • Healthcare Research: Study growth and vital sign patterns.
    • Education: Practice EDA, visualization, and predictive modeling on health datasets.

    πŸ“¬ Author & Contact

    Created by [Arif Miah]
    I am passionate about AI, Healthcare Analytics, and App Development.
    You can connect with me:

    ⚠️ Disclaimer

    This is a synthetic dataset created for educational and research purposes only.
    It should NOT be used for actual medical diagnosis or treatment decisions.

  16. Titanic-json-format

    • kaggle.com
    zip
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Basit AI (2025). Titanic-json-format [Dataset]. https://www.kaggle.com/datasets/engrbasit62/titanic-json-format
    Explore at:
    zip(33844 bytes)Available download formats
    Dataset updated
    Sep 21, 2025
    Authors
    Abdul Basit AI
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸ›³οΈ Titanic Dataset (JSON Format) πŸ“Œ Overview

    This is the classic Titanic: Machine Learning from Disaster dataset, converted into JSON format for easier use in APIs, data pipelines, and Python projects. It contains the same passenger details as the original CSV version, but stored as JSON for convenience.

    πŸ“‚ Dataset Contents

    File: titanic.json

    Columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked

    Use Cases: Exploratory Data Analysis (EDA), feature engineering, machine learning model training, web app backends, JSON parsing practice.

    πŸ› οΈ How to Use πŸ”Ή 1. Load with kagglehub import kagglehub

    Download the latest version of the dataset

    path = kagglehub.dataset_download("engrbasit62/titanic-json-format") print("Path to dataset files:", path)

    πŸ”Ή 2. Load into Pandas import pandas as pd

    Read the JSON file into a DataFrame

    df = pd.read_json(f"{path}/titanic.json")

    print(df.head())

    πŸ’‘ Notes

    Preview truncation: Kaggle may show only part of the JSON in the preview panel because of its size. βœ… Don’t worry β€” the full dataset is available when loaded via code.

    Benefits of JSON format: Ideal for web apps, APIs, or projects that work with structured data. Easily convertible back to CSV if needed.

  17. Amazone book data

    • kaggle.com
    zip
    Updated Jul 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Md Kaif Raza (2025). Amazone book data [Dataset]. https://www.kaggle.com/datasets/mdkaifraza123/amazone-book-data
    Explore at:
    zip(16707 bytes)Available download formats
    Dataset updated
    Jul 22, 2025
    Authors
    Md Kaif Raza
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset has been cleaned and preprocessed from its raw form using Python in a Jupyter Notebook. The following steps were taken during cleaning:

    1.Removed missing or null values
    
    2.Standardized column names and data formats
    
    3.Filtered out outliers or irrelevant rows
    
    4.Converted categorical variables where needed
    

    This file is ready for further exploratory data analysis (EDA), visualization, or machine learning tasks.

  18. List Of Beverages with Recipe

    • kaggle.com
    zip
    Updated Jan 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    khansahab24 (2025). List Of Beverages with Recipe [Dataset]. https://www.kaggle.com/datasets/khansahab24/list-of-beverages-with-recipe/discussion
    Explore at:
    zip(3029 bytes)Available download formats
    Dataset updated
    Jan 5, 2025
    Authors
    khansahab24
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context: The data was created to build a machine learning projects and to analyze. The data is useful to perform exploratory data analysis (EDA).

    Source: The csv data was scrapped from https://www.vegrecipesofindia.com/recipes/beverages/ and the dataset is created using BeautifulSoup python library.

    Inspiration: You can use the dataset for analyzing.

  19. RAPIDO_DATA_2025

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vengatesh vengat (2025). RAPIDO_DATA_2025 [Dataset]. https://www.kaggle.com/datasets/vengateshvengat/rapido-all-data
    Explore at:
    zip(1022138 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    vengatesh vengat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    πŸš– Rapido Ride Data β€” July 2025 πŸ“˜ Overview

    This dataset contains simulated Rapido ride data for July 2025, designed for data analysis, business intelligence, and machine learning use cases. It represents daily ride operations including customer bookings, driver performance, revenue generation, and service quality insights.

    🎯 Purpose

    The goal of this dataset is to help analysts and learners explore real-world mobility analytics. You can use it to:

    Build interactive dashboards (Power BI, Tableau, Excel)

    Perform exploratory data analysis (EDA)

    Create KPI reports and trend visualizations

    Train models for demand forecasting or cancellation prediction

    πŸ“‚ Dataset Details

    The dataset includes realistic, time-based entries covering one month of operations.

    Column Name Description ride_id Unique ID for each ride ride_date Date of the ride (July 2025) pickup_time Ride start time drop_time Ride end time ride_duration Duration of the ride (minutes) distance_km Distance travelled (in kilometers) fare_amount Fare charged to customer payment_mode Type of payment (Cash, UPI, Card) driver_id Unique driver identifier customer_id Unique customer identifier driver_rating Rating given by customer customer_rating Rating given by driver ride_status Completed, Cancelled by Driver, Cancelled by Customer city City where ride took place ride_type Bike, Auto, or Cab waiting_time Waiting time before ride started promo_used Yes/No for discount applied cancellation_reason Reason if ride cancelled revenue Net revenue earned per ride πŸ“Š Key Insights You Can Explore

    πŸ•’ Ride demand patterns by day & hour

    πŸ“… Cancellations by weekday/weekend

    🚦 Driver performance & customer satisfaction

    πŸ’° Revenue trends and top-performing drivers

    πŸŒ† City-wise ride distribution

    🧠 Suitable For

    Data cleaning & transformation practice

    Power BI / Excel dashboard building

    SQL analysis & reporting

    Predictive modeling (e.g., cancellation prediction, fare forecasting)

    βš™οΈ Tools You Can Use

    Power BI – For KPI dashboards & visuals

    Excel – For pivot tables & charts

    Python / Pandas – For EDA and ML

    SQL – For query-based insights

    πŸ’‘ Acknowledgment

    This dataset is synthetically generated for educational and analytical purposes. It does not represent actual Rapido data.

  20. Movies Dataset β€” Ratings, Release Dates & Origins

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    purv ghediya (2025). Movies Dataset β€” Ratings, Release Dates & Origins [Dataset]. https://www.kaggle.com/datasets/purvghediya/movies-dataset-ratings-release-dates-and-origins
    Explore at:
    zip(2539 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    purv ghediya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains cleaned and structured information about popular movies. It was processed using Python and Pandas to remove null values, fix inconsistent formats, and convert date columns to proper datetime types.

    The dataset includes attributes such as:

    🎬 Movie title

    ⭐ Average rating

    πŸ—“οΈ Release date (converted to datetime)

    🌍 Country of origin

    πŸ—£οΈ Spoken languages

    This cleaned dataset can be used for:

    Exploratory Data Analysis (EDA)

    Visualization practice

    Machine Learning experiments

    Data cleaning and preprocessing tutorials

    Source: IMDb Top Movies (via API / educational purpose)

    Last Updated: November 2025

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Organization logo

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Search
Clear search
Close search
Google apps
Main menu