16 datasets found
  1. B

    Data Cleaning Sample

    • borealisdata.ca
    • dataone.org
    Updated Jul 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 13, 2023
    Dataset provided by
    Borealis
    Authors
    Rong Luo
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Sample data for exercises in Further Adventures in Data Cleaning.

  2. asssing and cleaning datasets pandas

    • kaggle.com
    zip
    Updated Sep 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    naresh1502 (2025). asssing and cleaning datasets pandas [Dataset]. https://www.kaggle.com/datasets/naresh1502/asssing-and-cleaning-datasets-pandas
    Explore at:
    zip(37261 bytes)Available download formats
    Dataset updated
    Sep 5, 2025
    Authors
    naresh1502
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains synthetic (fake) clinical data created solely for educational purposes. It is designed to help learners practice data cleaning, preprocessing, and exploratory data analysis using Python, Pandas, or other data science tools. ** ⚠️ Disclaimer:** All the records in this dataset are randomly generated and do NOT represent any real individuals, patients, or organizations. Any resemblance to actual persons, living or dead, is purely coincidental. This dataset is safe to use publicly for tutorials, projects, and demonstrations.

    Use Cases:

    • Data cleaning and transformation practice with Pandas or Excel

    Exploratory Data Analysis (EDA)

    Learning how to handle missing values, duplicates, and data inconsistencies

    Practice for academic projects or YouTube tutorials

    Building machine learning pipelines with safe dummy data

    Dataset Structure: - Column Name Description patient_id Unique ID for each dummy patient assigned_sex Gender (Male/Female) given_name Randomly generated first name surname Randomly generated last name address Fake street address for demonstration city Random synthetic city name state State code (e.g., CA, TX, NY) zip_code Fake 5-digit ZIP code country Country (set as "United States" or similar placeholder) contact Fake phone number + email format birthdate Randomly generated birthdate (1970–2000) weight Weight of patient (kg) height Height of patient (inches/cm) bmi Calculated Body Mass Index

  3. g

    Video tutorial on data literacy​ training | gimi9.com

    • gimi9.com
    Updated Mar 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Video tutorial on data literacy​ training | gimi9.com [Dataset]. https://gimi9.com/dataset/mekong_video-tutorial-on-data-literacy-training/
    Explore at:
    Dataset updated
    Mar 23, 2025
    Description

    This video series presents 11 lessons and introduction to data literacy organized by the Open Development Cambodia Organization (ODC) to provide video tutorials on data literacy and the use of data in data storytelling. There are 12 videos which illustrate following sessions: * Introduction to the data literacy course * Lesson 1: Understanding data * Lesson 2: Explore data tables and data products * Lesson 3: Advanced Google Search * Lesson 4: Navigating data portals and validating data * Lesson 5: Common data format * Lesson 6: Data standard * Lesson 7: Data cleaning with Google Sheets * Lesson 8: Basic statistic * Lesson 9: Basic Data analysis using Google Sheets * Lesson 10: Data visualization * Lesson 11: Data Visualization with Flourish

  4. Surveys of Data Professionals (Alex the Analyst)

    • kaggle.com
    zip
    Updated Nov 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stewie (2023). Surveys of Data Professionals (Alex the Analyst) [Dataset]. https://www.kaggle.com/datasets/alexenderjunior/surveys-of-data-professionals-alex-the-analyst
    Explore at:
    zip(81050 bytes)Available download formats
    Dataset updated
    Nov 27, 2023
    Authors
    Stewie
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    [Dataset Name] - About This Dataset

    Overview

    This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.

    Context

    The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.

    Content

    The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].

    Acknowledgements

    The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.

    Citation

    If you use this dataset in your work, please cite it as follows:

    How to Use

    1. Download the dataset from this link.
    2. Explore the Jupyter Notebook in the associated repository for insights into the data cleaning process.

    Feel free to reach out for any additional information or clarification. Happy analyzing!

  5. Bookstore Financial Dataset 2019-2024 Calgary

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabrielle Charlton-Wells (2025). Bookstore Financial Dataset 2019-2024 Calgary [Dataset]. https://www.kaggle.com/datasets/gabriellecharlton/bookstore-financial-dataset-2019-2024-calgary
    Explore at:
    zip(1673906 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Gabrielle Charlton-Wells
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    Calgary
    Description

    📖 Overview

    This dataset represents a medium-sized Canadian bookstore business operating three retail locations across Calgary (Downtown, NW, SE) and a central warehouse.

    It covers 2019 to 2024, including the COVID-19 impact years (2020-2021) and post-pandemic recovery with inflation-adjusted growth. The data integrates finance, operations, HR, and customer analytics, perfect for data portfolio projects with specfic , KPI tracking, and realistic bookkeeping simulations.

    🧾 Files -> 12

    The dataset contains 9 CSV files representing different business metrics, a csv detailing the various columns in the 9 csv's, and a markdown README file:

    1. Bookstore Checking Balanced Dataset.csv: Daily bank account transactions (deposits, withdrawals, rolling balance).
    2. Bookstore Credit Balance Dataset.csv: Daily credit-card transactions (charges, payments, rolling balance).
    3. bookstore_sales.csv: Daily revenue by store and sales channel (gross / net / GST breakdown).
    4. bookstore_inventory.csv: Monthly warehouse-to-store inventory transfers and reorder levels.
    5. bookstore_employees_expanded.csv: Employee roster with department, role, employment type, and wages.
    6. bookstore_payroll_expanded.csv: Detailed payroll records with gross / net pay, deductions, and taxes.
    7. bookstore_loans.csv: Quarterly business loan balances, interest, and repayments (CEBA-style + LOC).
    8. bookstore_customers.csv: Clean customer file for customer-lifetime-value (LTV) analysis.
    9. bookstore_customers_expanded.csv: Expanded version of the customer dataset, including customer ratings.
    10. bookstore_customers_expanded_raw.csv: Messy version of the expanded customer dataset (duplicates, NA values) for data-cleaning exercises.
    11. data_dictionary.csv: Definitions of every column across all CSVs.
    12. README.md: Narrative summary and generation notes.

    🧮 Key Features

    Time span: 2019 – 2024

    Locations: Calgary -> Downtown (DT), NW, SE

    Currency: Canadian Dollars (CAD)

    Tax context: Alberta GST 5 %, no provincial PST

    Inflation factor: 1.00 → 1.18 (2019 → 2024) applied to payroll, sales, and loan interest

    💡 Example Analyses

    • Financial forecasting: Model cash flow trends using rolling bank balances.
    • Payroll cost analysis: Visualize seasonal vs. permanent staff expenses.
    • Sales forecasting: Fit time-series models by store/channel (e.g., ARIMA, Prophet).
    • Customer analytics: Segment LTV, churn probability, or satisfaction scores.
    • Data-cleaning demonstration: Compare bookstore_customers.csv vs. bookstore_customers_raw.csv.
    • Loan amortization & interest modeling: Analyze repayment structures over time.

    🧠 Intended Use

    This dataset is fully synthetic and designed for: - Business intelligence dashboards - Machine learning demos (forecasting, regression, clustering) - Financial and accounting analysis training - Data-cleaning and EDA (Exploratory Data Analysis) tutorials

    📜 License

    This dataset is released under the MIT License, free to use for research, learning, or commercial purposes.

    Photo: by Pixabay, free to use.

  6. Amazon Books Dataset (20K Books + 727K Reviews)

    • kaggle.com
    zip
    Updated Oct 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hadi Fariborzi (2025). Amazon Books Dataset (20K Books + 727K Reviews) [Dataset]. https://www.kaggle.com/datasets/hadifariborzi/amazon-books-dataset-20k-books-727k-reviews
    Explore at:
    zip(233373889 bytes)Available download formats
    Dataset updated
    Oct 21, 2025
    Authors
    Hadi Fariborzi
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.

    What's Included:

    Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:

    Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:

    Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.

  7. DHS EdData Survey 2010 - Nigeria

    • catalog.ihsn.org
    Updated Mar 29, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Population Commission (2019). DHS EdData Survey 2010 - Nigeria [Dataset]. https://catalog.ihsn.org/index.php/catalog/3344
    Explore at:
    Dataset updated
    Mar 29, 2019
    Dataset authored and provided by
    National Population Commission
    Time period covered
    2009 - 2010
    Area covered
    Nigeria
    Description

    Abstract

    The 2010 NEDS is similar to the 2004 Nigeria DHS EdData Survey (NDES) in that it was designed to provide information on education for children age 4–16, focusing on factors influencing household decisions about children’s schooling. The survey gathers information on adult educational attainment, children’s characteristics and rates of school attendance, absenteeism among primary school pupils and secondary school students, household expenditures on schooling and other contributions to schooling, and parents’/guardians’ perceptions of schooling, among other topics.The 2010 NEDS was linked to the 2008 Nigeria Demographic and Health Survey (NDHS) in order to collect additional education data on a subset of the households (those with children age 2–14) surveyed in the 2008 Nigeria DHS survey. The 2008 NDHS, for which data collection was carried out from June to October 2008, was the fourth DHS conducted in Nigeria (previous surveys were implemented in 1990, 1999, and 2003).

    The goal of the 2010 NEDS was to follow up with a subset of approximately 30,000 households from the 2008 NDHS survey. However, the 2008 NDHS sample shows that of the 34,070 households interviewed, only 20,823 had eligible children age 2–14. To make statistically significant observations at the State level, 1,700 children per State and the Federal Capital Territory (FCT) were needed. It was estimated that an additional 7,300 households would be required to meet the total number of eligible children needed. To bring the sample size up to the required target, additional households were screened and added to the overall sample. However, these households did not have the NDHS questionnaire administered. Thus, the two surveys were statistically linked to create some data used to produce the results presented in this report, but for some households, data were imputed or not included.

    Geographic coverage

    National

    Analysis unit

    Households Individuals

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The eligible households for the 2010 NEDS are the same as those households in the 2008 NDHS sample for which interviews were completed and in which there is at least one child age 2-14, inclusive. In the 2008 NDHS, 34,070 households were successfully interviewed, and the goal here was to perform a follow-up NEDS on a subset of approximately 30,000 households. However, records from the 2008 NDHS sample showed that only 20,823 had children age 4-16. Therefore, to bring the sample size up to the required number of children, additional households were screened from the NDHS clusters.

    The first step was to use the NDHS data to determine eligibility based on the presence of a child age 2-14. Second, based on a series of precision and power calculations, RTI determined that the final sample size should yield approximately 790 households per State to allow statistical significance for reporting at the State level, resulting in a total completed sample size of 790 × 37 = 29,230. This calculation was driven by desired estimates of precision, analytic goals, and available resources. To achieve the target number of households with completed interviews, we increased the final number of desired interviews to accommodate expected attrition factors such as unlocatable addresses, eligibility issues, and non-response or refusal. Third, to reach the target sample size, we selected additional samples from households that had been listed by NDHS but had not been sampled and visited for interviews. The final number of households with completed interviews was 26,934 slightly lower than the original target, but sufficient to yield interview data for 71,567 children, well above the targeted number of 1,700 children per State.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The four questionnaires used in the 2004 Nigeria DHS EdData Survey (NDES)— 1. Household Questionnaire 2. Parent/Guardian Questionnaire 3. Eligible Child Questionnaire 4. Independent Child Questionnaire—formed the basis for the 2010 NEDS questionnaires. These are all available in Appendix D of the survey report available under External Resources.

    More than 90 percent of the questionnaires remained the same; for cases where there was a clear justification or a need for a change in item formulation or a specific requirement for additional items, these were updated accordingly. A one day workshop was convened with the NEDS Implementation Team and the NDES Advisory Committee to review the instruments and identify any needed revisions, additions, or deletions. Efforts were made to collect data to ease integration of the 2010 NEDS data into the FMOE’s national education management information system. Instrument issues that were identified as being problematic in the 2004 NDES as well as items identified as potentially confusing or difficult were proposed for revision. Issues that USAID, DFID, FMOE, and other stakeholders identified as being essential but not included in the 2004 NDES questionnaires were proposed for incorporation into the 2010 NEDS instruments, with USAID serving as the final arbiter regarding questionnaire revisions and content.

    General revisions accepted into the questionnaires included the following: - A separation of all questions related to secondary education into junior secondary and senior secondary to reflect the UBE policy - Administration of school-based questions for children identified as attending pre-school - Inclusion of questions on disabilities of children and parents - Additional questions on Islamic schooling - Revision to the literacy question administration to assess English literacy for children attending school - Some additional questions on delivery of UBE under the financial questions section

    Upon completion of revisions to the English-language questionnaires, the instruments were translated and adapted by local translators into three languages—Hausa, Igbo, and Yoruba—and then back-translated into English to ensure accuracy of the translation. After the questionnaires were finalized, training materials used in the 2004 NDES and developed by Macro International, which included training guides, data collection manuals, and field observation materials, were reviewed. The materials were updated to reflect changes in the questionnaires. In addition, the procedures as described in the manuals and guides were carefully reviewed. Adjustments were made, where needed, based on experience on large-scale survey and lessons learned from the 2004 NDES and the 2008 NDHS, to ensure the highest quality data capture.

    Cleaning operations

    Data processing for the 2010 NEDS occurred concurrently with data collection. Completed questionnaires were retrieved by the field coordinators/trainers and delivered to NPC in standard envelops, labeled with the sample identification, team, and State name. The shipment also contained a written summary of any issues detected during the data collection process. The questionnaire administrators logged the receipt of the questionnaires, acknowledged the list of issues, and acted upon them if required. The editors performed an initial check on the questionnaires, performed any coding of open-ended questions (with possible assistance from the data entry operators), and left them available to be assigned to the data entry operators. The data entry operators entered the data into the system, with the support of the editors for erroneous or unclear data.

    Experienced data entry personnel were recruited from those who have performed data entry activities for NPC on previous studies. The data entry teams composed a data entry coordinator, supervisor and operators. Data entry coordinators oversaw the entire data entry process from programming and training to final data cleaning, made assignments, tracked progress, and ensured the quality and timeliness of the data entry process. Data entry supervisors were on hand at all times to ensure that proper procedures were followed and to help editors resolve any uncovered inconsistencies. The supervisors controlled incoming questionnaires, assigned batches of questionnaires to the data entry operators, and managed their progress. Approximately 30 clerks were recruited and trained as data entry operators to enter all completed questionnaires and to perform the secondary entry for data verification. Editors worked with the data entry operators to review information flagged as “erroneous” or “dubious” in the data entry process and provided follow up and resolution for those anomalies.

    The data entry program developed for the 2004 NDES was revised to reflect the revisions in the 2010 NEDS questionnaire. The electronic data entry and reporting system ensured internal consistency and inconsistency checks.

    Response rate

    A very high overall response rate of 97.9 percent was achieved with interviews completed in 26,934 households out of a total of 27,512 occupied households from the original sample of 28,624 households. The response rates did not vary significantly by urban–rural (98.5 percent versus 97.6 percent, respectively). The response rates for parent/guardians and children were even higher, and the rate for independent children was slightly lower than the overall sample rate, 97.4 percent. In all these cases, the urban/rural differences were negligible.

    Sampling error estimates

    Estimates derived from a sample survey are affected by two types of errors: (1) non-sampling errors and (2) sampling errors. Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as

  8. Movies Dataset — Ratings, Release Dates & Origins

    • kaggle.com
    zip
    Updated Nov 4, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    purv ghediya (2025). Movies Dataset — Ratings, Release Dates & Origins [Dataset]. https://www.kaggle.com/datasets/purvghediya/movies-dataset-ratings-release-dates-and-origins
    Explore at:
    zip(2539 bytes)Available download formats
    Dataset updated
    Nov 4, 2025
    Authors
    purv ghediya
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains cleaned and structured information about popular movies. It was processed using Python and Pandas to remove null values, fix inconsistent formats, and convert date columns to proper datetime types.

    The dataset includes attributes such as:

    🎬 Movie title

    ⭐ Average rating

    🗓️ Release date (converted to datetime)

    🌍 Country of origin

    🗣️ Spoken languages

    This cleaned dataset can be used for:

    Exploratory Data Analysis (EDA)

    Visualization practice

    Machine Learning experiments

    Data cleaning and preprocessing tutorials

    Source: IMDb Top Movies (via API / educational purpose)

    Last Updated: November 2025

  9. Weather_dataset

    • kaggle.com
    zip
    Updated Nov 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ahmed Sayed (2025). Weather_dataset [Dataset]. https://www.kaggle.com/datasets/ahmedsayed0007/weather-dataset
    Explore at:
    zip(291 bytes)Available download formats
    Dataset updated
    Nov 13, 2025
    Authors
    Ahmed Sayed
    Description

    Overview

    This dataset contains daily weather observations, including temperature, wind speed, and weather events recorded over multiple days. It is a simple and clean dataset suitable for beginners and intermediate users who want to practice data cleaning, handling missing values, exploratory data analysis (EDA), visualization, and basic predictive modeling.

    Dataset Structure

    Each row represents a single day's weather record.

    Columns

    day Date of the observation.

    temperature — Recorded temperature of the day (in °F). windspeed — Wind speed of the day (in mph). event — Weather event such as Rain, Sunny, or Snow.

    Key Characteristics

    Contains missing values in temperature, windspeed, and event columns. Useful for practicing:

    Data cleaning and imputation Time-series formatting Handling categorical data Basic statistical analysis Simple forecasting tasks

    Intended Use

    This dataset is suitable for educational and demonstration purposes, including:

    Data preprocessing tutorials Pandas practice notebooks Visualization exercises Introductory machine learning tasks

  10. Infosys (INFY) Last 5 years Dataset

    • kaggle.com
    zip
    Updated Oct 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kumar Siddhant (2025). Infosys (INFY) Last 5 years Dataset [Dataset]. https://www.kaggle.com/datasets/imsiddhant/infosys-infy-last-5-years-dataset/data
    Explore at:
    zip(47575 bytes)Available download formats
    Dataset updated
    Oct 13, 2025
    Authors
    Kumar Siddhant
    Description

    This dataset provides five years of daily stock market data for Infosys Ltd. (INFY) — one of India’s largest multinational IT services and consulting firms.

    It contains key daily metrics such as Open, High, Low, Close prices, and Trading Volume, covering the period from Oct 2020 to Oct 2025.

    The dataset is ideal for financial time series analysis, machine learning forecasts, algorithmic trading strategies, and investment research

    📅 Dataset Summary Column Name Description Date Trading date in YYYY-MM-DD format Ticker Stock symbol (INFY) representing Infosys Ltd. Open Opening price of the stock on the given day High Highest price reached during the trading session Low Lowest price reached during the trading session Close Closing price at the end of the trading day Volume Number of shares traded on that day

    File name: INFY_5years_data.csv Format: CSV (UTF-8 encoded) Period covered: ~2020–2025 Records: ~1,250 rows (approx. 250 trading days per year × 5 years) Columns: 7 (Date, Ticker, Open, High, Low, Close, Volume)

    🔍 Potential Use Cases You can use this dataset for:

    📊 Trend analysis – identify price patterns and seasonality 🤖 Machine learning – build stock price prediction or volatility models 💡 Investment strategy testing – simulate buy/sell signals (using moving averages, RSI, etc.) 🧩 Time-series forecasting – using ARIMA, LSTM, or Prophet models 🎓 Educational projects – financial analytics and data cleaning tutorials

  11. Bank Customer Analysis Done Using Power Bi

    • kaggle.com
    zip
    Updated Sep 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Srividya Uppalur (2023). Bank Customer Analysis Done Using Power Bi [Dataset]. https://www.kaggle.com/srividyauppalur/bank-customer-analysis-done-using-power-bi
    Explore at:
    zip(83758 bytes)Available download formats
    Dataset updated
    Sep 11, 2023
    Authors
    Srividya Uppalur
    Description

    Bank Data Analysis | Real World Project | Power BI In this Visualization, I have followed the process of analyzing Bank dataset using Microsoft Power BI. I have started by importing the data into Power BI and then i performed the data cleaning, transformation, and visualization on the given data to gain insights and create a comprehensive analysis report.

    Here i have created the insightful visualizations and interactive reports that can be used for business intelligence and decision-making purposes.

    Data Set: Took the support from tutorial by Data Visionary.

    You tube Video referred: https://www.youtube.com/watch?v=GZqBefbNP10&t=1581s

    Analysis done and Visualization shown on: 1. Balance by Age and Gender 2. Number of Customers by Age and Gender 3. Number of Customers by Region 4. Balance by Region 5. Number of Customers by JobType 6. Balance by Gender 7. Total Customers Joined 8. Cards- i) Max Balance by Age ii) Min Balance by Age iii) Max Customers by Gender

    Dear All, Kindly go through the same and please provide me the suggestions and guide me for any changes required and correct me where i need to improve.

  12. RTEM Hackaton API and Data Science Tutorials

    • kaggle.com
    zip
    Updated Apr 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pony Biam (2022). RTEM Hackaton API and Data Science Tutorials [Dataset]. https://www.kaggle.com/datasets/ponybiam/onboard-api-intro
    Explore at:
    zip(14011904 bytes)Available download formats
    Dataset updated
    Apr 14, 2022
    Authors
    Pony Biam
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    RTEM Hackathon Tutorials

    This data set and associated notebooks are meant to give you a head start in accessing the RTEM Hackathon by showing some examples of data extraction, processing, cleaning, and visualisation. Data availabe in this Kaggle page is only a selected part of the whole data set extracted for the tutorials. A series of Video Tutorials are associated with this dataset and notebooks and is found on the Onboard YouTube channel.

    Part 1 - Onboard API and Onboard API Wrapper Introduction

    An introduction to the API usage and how to retrieve data from it. This notebook is outlined in several YouTube videos that discuss: - how to get started with your account and get oriented to the Kaggle environment, - get acquainted with the Onboard API, - and start using the Onboard API wrapper to extract and explore data.

    Part 2 - Meta-data and Point Exploration Demo

    How to query data points meta-data, process them and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to get started exploring building metadata/points, - select/merge point lists and export as CSV - and visualize and explore the point lists

    Part 3 - Time-series Data Extraction and Exploration Demo

    How to query time-series from data points, process and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to load and filter time-series data from sensors - resample and transform time-series data - and create heat maps and boxplots of data for exploration

    Part 4 - Example of starting point for analysis for RTEM and possible directions of analysis

    A quick example of a starting point towards the analysis of the data for some sort of solution and reference to a paper that might help get an overview of the possible directions your team can go in. This notebook is outlined in several YouTube videos that discuss: - overview of use cases and judging criteria - an example of a real-world hypothesis - further development of that simple example

    More information about the data and competition can be found on the RTEM Hackathon website.

  13. Online Bookstore Dataset

    • kaggle.com
    zip
    Updated May 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hamidreza Naderbeygi (2025). Online Bookstore Dataset [Dataset]. https://www.kaggle.com/datasets/benjnb/online-bookstore-dataset/code
    Explore at:
    zip(46685 bytes)Available download formats
    Dataset updated
    May 17, 2025
    Authors
    Hamidreza Naderbeygi
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset contains information about 1,000 books from an online bookstore, including their titles, prices, availability, ratings, categories, and product page URLs. It is ideal for projects involving:

    Web scraping and data extraction tutorials

    Natural Language Processing (e.g. analyzing book titles)

    E-commerce data analysis and visualization

    Recommendation systems based on category or price

    Data cleaning and preprocessing practice

  14. Jimrealtex customer dataset

    • kaggle.com
    zip
    Updated Nov 22, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    JIMOH YEKINI (2025). Jimrealtex customer dataset [Dataset]. https://www.kaggle.com/datasets/jimohyekini/jimrealtex-customer-dataset
    Explore at:
    zip(1591 bytes)Available download formats
    Dataset updated
    Nov 22, 2025
    Authors
    JIMOH YEKINI
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description: Jimrealtex Customer Dataset

    This dataset contains customer demographic and behavioral information designed for exploring segmentation, clustering, and predictive analytics in retail and marketing contexts. It provides a simple yet powerful foundation for practicing data science techniques such as K-Means clustering, customer profiling, and recommendation systems.

    ### Dataset Features - CustomerID: Unique identifier for each customer
    - Genre: Gender of the customer (Male/Female)
    - Age: Age of the customer (years)
    - Annual Income (k$): Annual income in thousands of dollars
    - Spending Score: A score assigned by the business based on customer behavior and spending patterns

    Notes - Some records contain missing values (nan) in Age, Annual Income, or Spending Score. These can be handled using imputation, removal, or advanced techniques depending on the analysis.
    - Spending Score is an arbitrary metric often used in clustering exercises to simulate customer engagement.

    ### Potential Use Cases - Customer Segmentation: Apply clustering algorithms (e.g., K-Means, DBSCAN) to group customers by income and spending habits.
    - Marketing Strategy: Identify high-value customers and tailor promotions.
    - Predictive Modeling: Build models to predict spending behavior based on demographics.
    - Data Cleaning Practice: Handle missing values and prepare the dataset for machine learning tasks.

    ** Why This Dataset?**

    This dataset is widely used in machine learning tutorials and business analytics projects because it is small, interpretable, and directly applicable to real-world scenarios like retail customer analysis. It’s ideal for beginners learning clustering and for professionals prototyping segmentation strategies.

  15. Titanic

    • kaggle.com
    zip
    Updated Aug 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anand Kumar (2025). Titanic [Dataset]. https://www.kaggle.com/datasets/anandambastha/titanic
    Explore at:
    zip(22552 bytes)Available download formats
    Dataset updated
    Aug 1, 2025
    Authors
    Anand Kumar
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The Titanic dataset is one of the most iconic and frequently used datasets in the data science and machine learning community. It originates from the tragic sinking of the RMS Titanic on April 15, 1912, after it struck an iceberg during its maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history.

    This dataset provides detailed information on a subset of the passengers aboard the Titanic and is primarily used to build predictive models to determine whether a passenger survived or not, based on the available features. It is a supervised learning problem, specifically a binary classification task, where the target variable is Survived (1 = Yes, 0 = No).

    Purpose and Use Cases

    The Titanic dataset is commonly used for: - Learning data preprocessing techniques such as handling missing values, encoding categorical variables, and feature scaling - Performing exploratory data analysis (EDA) and creating visualizations - Engineering new features from existing data to enhance model performance - Training and evaluating various classification models such as Logistic Regression, Decision Trees, Random Forests, and XGBoost - Benchmarking classification pipelines in data science competitions, especially on platforms like Kaggle

    Key Features / Columns

    • PassengerId: A unique identifier for each passenger -Survived: Survival (0 = No, 1 = Yes) – target variable -Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) indicating socio-economic status -Name: Full name of the passenger -Sex: Gender (male, female) -Age: Age of the passenger (in years); may contain missing values -SibSp: Number of siblings or spouses aboard the Titanic -Parch: Number of parents or children aboard -Ticket: Ticket number -Fare: Passenger fare (British pounds) -Cabin: Cabin number; many entries are missing -Embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

    Challenges and Considerations

    • Missing Values: Columns such as Age, Cabin, and Embarked contain missing entries that need to be addressed during preprocessing -Imbalanced Features: Certain features, like passenger class, exhibit survival rate bias that may affect the model -Non-numerical Data: Features like Name, Sex, Cabin, and Embarked must be transformed or encoded for modeling -Feature Engineering Opportunities: Additional features can be derived such as extracting titles from names, computing family size from SibSp and Parch, or analyzing cabin or ticket patterns

    Why It's Popular

    The Titanic dataset is based on a real-world historical event, making it intuitive and engaging for learners. It is especially suitable for beginners looking to understand the end-to-end machine learning pipeline. The dataset's moderate size and feature variety encourage experimentation in data cleaning, transformation, visualization, and modeling. It is frequently used in online tutorials, courses, and machine learning competitions to demonstrate model development and evaluation practices.

  16. small_itsm_dataset

    • kaggle.com
    zip
    Updated Mar 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nikola Greb (2023). small_itsm_dataset [Dataset]. https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset
    Explore at:
    zip(21489 bytes)Available download formats
    Dataset updated
    Mar 30, 2023
    Authors
    Nikola Greb
    License

    http://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html

    Description

    Small toy data inspired by ITSM (IT service management) tickets. Including noisy labels, multiple languages and missing data on purpose. Here is one data examination and cleaning procedure written by me:

    Feel free to add yours!

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Rong Luo (2023). Data Cleaning Sample [Dataset]. http://doi.org/10.5683/SP3/ZCN177

Data Cleaning Sample

Explore at:
175 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 13, 2023
Dataset provided by
Borealis
Authors
Rong Luo
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Sample data for exercises in Further Adventures in Data Cleaning.

Search
Clear search
Close search
Google apps
Main menu