19 datasets found
  1. Ecommerce Dataset for Data Analysis

    • kaggle.com
    zip
    Updated Sep 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
    Explore at:
    zip(2028853 bytes)Available download formats
    Dataset updated
    Sep 19, 2024
    Authors
    Shrishti Manja
    Description

    This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

    About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

    Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

    This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

    This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

  2. BI intro to data cleaning eda and machine learning

    • kaggle.com
    zip
    Updated Nov 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Walekhwa Tambiti Leo Philip (2025). BI intro to data cleaning eda and machine learning [Dataset]. https://www.kaggle.com/datasets/walekhwatlphilip/intro-to-data-cleaning-eda-and-machine-learning/suggestions
    Explore at:
    zip(9961 bytes)Available download formats
    Dataset updated
    Nov 17, 2025
    Authors
    Walekhwa Tambiti Leo Philip
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Real-World Data Science Challenge

    Business Intelligence Program Strategy — Student Success Optimization

    Hosted by: Walsoft Computer Institute 📁 Download dataset 👤 Kaggle profile

    Background

    Walsoft Computer Institute runs a Business Intelligence (BI) training program for students from diverse educational, geographical, and demographic backgrounds. The institute has collected detailed data on student attributes, entry exams, study effort, and final performance in two technical subjects: Python Programming and Database Systems.

    As part of an internal review, the leadership team has hired you — a Data Science Consultant — to analyze this dataset and provide clear, evidence-based recommendations on how to improve:

    • Admissions decision-making
    • Academic support strategies
    • Overall program impact and ROI

    Your Mission

    Answer this central question:

    “Using the BI program dataset, how can Walsoft strategically improve student success, optimize resources, and increase the effectiveness of its training program?”

    Key Strategic Areas

    You are required to analyze and provide actionable insights for the following three areas:

    1. Admissions Optimization

    Should entry exams remain the primary admissions filter?

    Your task is to evaluate the predictive power of entry exam scores compared to other features such as prior education, age, gender, and study hours.

    ✅ Deliverables:

    • Feature importance ranking for predicting Python and DB scores
    • Admission policy recommendation (e.g., retain exams, add screening tools, adjust thresholds)
    • Business rationale and risk analysis

    2. Curriculum Support Strategy

    Are there at-risk student groups who need extra support?

    Your task is to uncover whether certain backgrounds (e.g., prior education level, country, residence type) correlate with poor performance and recommend targeted interventions.

    ✅ Deliverables:

    • At-risk segment identification
    • Support program design (e.g., prep course, mentoring)
    • Expected outcomes, costs, and KPIs

    3. Resource Allocation & Program ROI

    How can we allocate resources for maximum student success?

    Your task is to segment students by success profiles and suggest differentiated teaching/facility strategies.

    ✅ Deliverables:

    • Performance drivers
    • Student segmentation
    • Resource allocation plan and ROI projection

    🛠️ Dataset Overview

    ColumnDescription
    fNAME, lNAMEStudent first and last name
    AgeStudent age (21–71 years)
    genderGender (standardized as "Male"/"Female")
    countryStudent’s country of origin
    residenceStudent housing/residence type
    entryEXAMEntry test score (28–98)
    prevEducationPrior education (High School, Diploma, etc.)
    studyHOURSTotal study hours logged
    PythonFinal Python exam score
    DBFinal Database exam score

    📊 Dataset

    You are provided with a real-world messy dataset that reflects the types of issues data scientists face every day — from inconsistent formatting to missing values.

    Raw Dataset (Recommended for Full Project)

    Download: bi.csv

    This dataset includes common data quality challenges:

    • Country name inconsistencies
      e.g. Norge → Norway, RSA → South Africa, UK → United Kingdom

    • Residence type variations
      e.g. BI-Residence, BIResidence, BI_Residence → unify to BI Residence

    • Education level typos and casing issues
      e.g. Barrrchelors → Bachelor, DIPLOMA, DiplomaaaDiploma

    • Gender value noise
      e.g. M, F, female → standardize to Male / Female

    • Missing scores in Python subject
      Fill NaN values using column mean or suitable imputation strategy

    Participants using this dataset are expected to apply data cleaning techniques such as: - String standardization - Null value imputation - Type correction (e.g., scores as float) - Validation and visual verification

    Bonus: Submissions that use and clean this dataset will earn additional Technical Competency points.

    Cleaned Dataset (Optional Shortcut)

    Download: cleaned_bi.csv

    This version has been fully standardized and preprocessed: - All fields cleaned and renamed consistently - Missing Python scores filled with th...

  3. EDA on Car Sales Dataset in Ukraine

    • kaggle.com
    zip
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Swati Khedekar (2023). EDA on Car Sales Dataset in Ukraine [Dataset]. https://www.kaggle.com/datasets/swatikhedekar/eda-on-car-sales-dataset-in-ukraine
    Explore at:
    zip(508971 bytes)Available download formats
    Dataset updated
    Jan 13, 2023
    Authors
    Swati Khedekar
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Ukraine
    Description

    1. Problem statemont:

    This dataset contains data more than 9.5k car sales in Ukraine.Most of then are used car so it open the possibility to analyze featurs related to car operation. This is subset of all car data in Ukraine. Using this we will analyze the various parameter of used car sales in Ukraine.

    1.1 Introduction: This Exploratory Data Analysis is to practice python skills till now on a structured dataset including loading, inspecting,wrangling,Exploring and drawing conclusions from the data.The notebook has the obeservations with each step in order to explain throughtly how to approach the dataset. Based on the obseravation some quetions also are answered in the notebook for the reference though not all them are explored in the analysis.

    1.2 Data Source and Dataset: a. How was it collected?

    Name: Car Sales Sponsering Organization: Dont Know! Year :2019 Description: This is case study of more than 9.5k car sales in Ukraine. b. it is sample? If yes ,What is properly sampled?

    Yes .It is sample .We dont have official information about the data collection method, but its appears not to be random sample, so we can assume that it is not representative.

  4. Pakistan Online Product Sales Dataset

    • kaggle.com
    zip
    Updated Nov 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aliza Brand (2025). Pakistan Online Product Sales Dataset [Dataset]. https://www.kaggle.com/datasets/shahzadi786/pakistan-online-product-sales-dataset
    Explore at:
    zip(13739 bytes)Available download formats
    Dataset updated
    Nov 16, 2025
    Authors
    Aliza Brand
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Pakistan
    Description

    Context

    Online e-commerce is rapidly growing in Pakistan. Sellers list thousands of products across multiple categories, each with different prices, ratings, and sales numbers. Understanding the patterns of product sales, pricing, and customer feedback is crucial for businesses and data scientists alike.

    This dataset simulates a realistic snapshot of online product sales in Pakistan, including diverse categories like Electronics, Clothing, Home & Kitchen, Books, Beauty, and Sports.

    Source

    Generated synthetically using Python and NumPy for learning and practice purposes.

    No real personal or private data is included.

    Designed specifically for Kaggle competitions, notebooks, and ML/EDA exercises.

    About the File

    File name: Pakistan_Online_Product_Sales.csv

    Rows: 1000+

    Columns: 6

    Purpose:

    Train Machine Learning models (regression/classification)

    Explore data through EDA and visualizations

    Practice feature engineering and data preprocessing

  5. Bank Loan Case Study Dataset

    • kaggle.com
    zip
    Updated May 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shreshth Vashisht (2023). Bank Loan Case Study Dataset [Dataset]. https://www.kaggle.com/datasets/shreshthvashisht/bank-loan-case-study-dataset/discussion
    Explore at:
    zip(117814223 bytes)Available download formats
    Dataset updated
    May 4, 2023
    Authors
    Shreshth Vashisht
    Description

    This case study aims to give you an idea of applying EDA in a real business scenario. In this case study, apart from applying the techniques that you have learnt in the EDA module, you will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimize the risk of losing money while lending to customers.

    Business Understanding: The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it as their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.

    When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:

    If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company. If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company. The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:

    The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample All other cases: All other cases when the payment is paid on time. When a client applies for a loan, there are four types of decisions that could be taken by the client/company:

    Approved: The company has approved loan application Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client he received worse pricing which he did not want. Refused: The company had rejected the loan (because the client does not meet their requirements etc.). Unused Offer: Loan has been cancelled by the client but on different stages of the process. In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency of default.

    Business Objectives: It aims to identify patterns which indicate if a client has difficulty paying their installments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.

    In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilize this knowledge for its portfolio and risk assessment.

    To develop your understanding of the domain, you are advised to independently research a little about risk analytics – understanding the types of variables and their significance should be enough).

    Data Understanding: Download the Dataset using the link given under dataset section on the right.

    application_data.csv contains all the information of the client at the time of application. The data is about wheather a client has payment difficulties. previous_application.csv contains information about the client’s previous loan data. It contains the data whether the previous application had been Approved, Cancelled, Refused or Unused offer. columns_descrption.csv is data dictionary which describes the meaning of the variables. You are required to provide a detailed report for the below data record mentioning the answer to the questions that follows:

    Present the overall approach of the analysis. Mention the problem statement and the analysis approach briefly Indentify the missing data and use appropriate method to deal with it. (Remove columns/or replace it with an appropriate value) Hint: Note that in EDA, since it is not necessary to replace the missing value, but if you have to replace the missing value, what should be the approach. Clearly mention the approach. Identify if there are outliers in the dataset. Also, mention why do you think it is an outlier. Again, remember that for this exercise, it is not necessary to remove any data points. Identify if there is data imbalance in the data. Find the ratio of data imbalance. Hint: Since there are a lot of columns, you can run your analysis in loops for the appropriate columns and find the insights. Explain the results of univariate, segmented univariate, bivariate analysis, etc. in business terms. Find the top 10 c...

  6. RAPIDO_DATA_2025

    • kaggle.com
    zip
    Updated Oct 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    vengatesh vengat (2025). RAPIDO_DATA_2025 [Dataset]. https://www.kaggle.com/datasets/vengateshvengat/rapido-all-data
    Explore at:
    zip(1022138 bytes)Available download formats
    Dataset updated
    Oct 9, 2025
    Authors
    vengatesh vengat
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🚖 Rapido Ride Data — July 2025 📘 Overview

    This dataset contains simulated Rapido ride data for July 2025, designed for data analysis, business intelligence, and machine learning use cases. It represents daily ride operations including customer bookings, driver performance, revenue generation, and service quality insights.

    🎯 Purpose

    The goal of this dataset is to help analysts and learners explore real-world mobility analytics. You can use it to:

    Build interactive dashboards (Power BI, Tableau, Excel)

    Perform exploratory data analysis (EDA)

    Create KPI reports and trend visualizations

    Train models for demand forecasting or cancellation prediction

    📂 Dataset Details

    The dataset includes realistic, time-based entries covering one month of operations.

    Column Name Description ride_id Unique ID for each ride ride_date Date of the ride (July 2025) pickup_time Ride start time drop_time Ride end time ride_duration Duration of the ride (minutes) distance_km Distance travelled (in kilometers) fare_amount Fare charged to customer payment_mode Type of payment (Cash, UPI, Card) driver_id Unique driver identifier customer_id Unique customer identifier driver_rating Rating given by customer customer_rating Rating given by driver ride_status Completed, Cancelled by Driver, Cancelled by Customer city City where ride took place ride_type Bike, Auto, or Cab waiting_time Waiting time before ride started promo_used Yes/No for discount applied cancellation_reason Reason if ride cancelled revenue Net revenue earned per ride 📊 Key Insights You Can Explore

    🕒 Ride demand patterns by day & hour

    📅 Cancellations by weekday/weekend

    🚦 Driver performance & customer satisfaction

    💰 Revenue trends and top-performing drivers

    🌆 City-wise ride distribution

    🧠 Suitable For

    Data cleaning & transformation practice

    Power BI / Excel dashboard building

    SQL analysis & reporting

    Predictive modeling (e.g., cancellation prediction, fare forecasting)

    ⚙️ Tools You Can Use

    Power BI – For KPI dashboards & visuals

    Excel – For pivot tables & charts

    Python / Pandas – For EDA and ML

    SQL – For query-based insights

    💡 Acknowledgment

    This dataset is synthetically generated for educational and analytical purposes. It does not represent actual Rapido data.

  7. Saudi Arabia Events & Crowding Impact Dataset

    • kaggle.com
    zip
    Updated Feb 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mohamed Samy (2025). Saudi Arabia Events & Crowding Impact Dataset [Dataset]. https://www.kaggle.com/datasets/mohamedsamy16/saudi-arabia-events-and-crowding-impact-dataset
    Explore at:
    zip(22590 bytes)Available download formats
    Dataset updated
    Feb 12, 2025
    Authors
    Mohamed Samy
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Saudi Arabia
    Description

    📦 Saudi Arabia Events & Crowding Impact Dataset

    Unlock insights into crowding, sales trends, and delivery optimization using public events, weather, and paydays.

    📝 Dataset Overview

    This dataset captures public events, holidays, weather conditions, and financial factors that influence crowding, consumer behavior, and online deliveries across Saudi Arabia.

    Key Highlights:
    ✅ Covers multiple Saudi cities with rich event data.
    ✅ Includes weather conditions affecting business & logistics.
    ✅ Tracks paydays & school schedules for demand forecasting.
    ✅ Ideal for crowding prediction, sales analysis, and delivery optimization.

    📊 Data Description

    Each row represents a daily snapshot of city conditions with the following variables:

    📆 Date & Calendar Information

    • DateG – Gregorian date (YYYY-MM-DD).
    • DateH – Hijri date.
    • Day – Day of the week (Sunday, Monday, etc.).

    🎉 Public Holidays & Events

    • Holiday Name – Name of the holiday (if applicable).
    • Type of Public Holiday – National, Religious, or School-related holidays.
    • Event – Major events (e.g., festivals, matches, etc.).
    • Match – Includes Premier League & KSA League games.

    🌦 Weather Conditions

    • Cloudy, Fog, Rain, Widespread Dust, Blowing Dust, etc.
    • Useful for studying weather impact on mobility & sales.

    🏙 Crowding & City Impact

    • City – Name of the city.
    • Effect on City – Expected impact (e.g., increased traffic, reduced mobility).

    💰 Economic & Financial Impact

    • Pay Day – Indicates whether it was a salary payout day.
    • days till next payday – How many days until the next salary payout.
    • days after payday – How many days after the last payday.

    🎓 Education & School Impact

    • days after school – Number of days since school ended.
    • days before school – Number of days until school resumes.

    🚀 Potential Use Cases

    This dataset can be leveraged for:

    📌 Crowding Prediction – Identify peak congestion periods based on holidays, weather, and events.
    📌 Sales & Demand Forecasting – Analyze payday effects on consumer spending & delivery volumes.
    📌 Delivery Optimization – Find the best times for online deliveries to avoid congestion.
    📌 Weather Impact Analysis – Study how dust storms & rain affect mobility & e-commerce.
    📌 Event-driven Business Planning – Plan logistics around national events & sports matches.

    📈 Exploratory Data Analysis (EDA)

    🔍 Ideas for Data Exploration

    • Visualize order volume trends across paydays, school terms, & holidays.
    • Analyze correlations between weather conditions & delivery delays.
    • Find seasonal trends in crowding & online shopping behavior.

    🔥 Example Analysis in Python

    import pandas as pd
    import matplotlib.pyplot as plt
    
    # Load the dataset
    df = pd.read_csv("saudi_events.csv")
    
    # Convert date column to datetime format
    df['DateG'] = pd.to_datetime(df['DateG'])
    
    # Plot orders over time
    plt.figure(figsize=(10,5))
    df.groupby('DateG')['days after payday'].mean().plot()
    plt.title("Effect of Payday on Consumer Activity")
    plt.xlabel("Date")
    plt.ylabel("Days After Payday")
    plt.show()
    

    📌 Getting Started

    How to Use the Dataset:

    1️⃣ Download the dataset and load it into Python or R.
    2️⃣ Perform EDA to uncover insights into crowding & spending patterns.
    3️⃣ Use classification models to predict crowding based on weather, holidays & city impact.
    4️⃣ Apply time-series forecasting for sales & delivery demand projections.

    🏆 Why This Dataset is Valuable

    📊 Multidimensional Insights – Combines weather, paydays, and events for a complete picture of crowding & sales trends.
    📌 Business & Logistics Applications – Helps companies plan deliveries, optimize marketing, and predict demand.
    Unique & Rich Data – A rare dataset covering Saudi Arabia's socio-economic events & crowd impact.

    📜 License & Acknowledgments

    • 📖 License: CC BY 4.0 – Free to use with attribution.

    Conclusion

    This dataset is a powerful tool for online delivery companies, businesses, and city planners looking to optimize operations. By analyzing external factors like holidays, paydays, weather, and events, we can predict crowding, improve delivery timing, and forecast sales trends.

    🚀 We welcome feedback and contributions! If you find this dataset useful, please ⭐ it on Kaggle and share your insights!

  8. All Lending Club loan data

    • kaggle.com
    zip
    Updated Apr 10, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nathan George (2019). All Lending Club loan data [Dataset]. https://www.kaggle.com/wordsforthewise/lending-club
    Explore at:
    zip(1356507910 bytes)Available download formats
    Dataset updated
    Apr 10, 2019
    Authors
    Nathan George
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    Update: I probably won't be able to update the data anymore, as LendingClub now has a scary 'TOS' popup when downloading the data. Worst case, they will ask me/Kaggle to take it down from here.

    This dataset contains the full LendingClub data available from their site. There are separate files for accepted and rejected loans. The accepted loans also include the FICO scores, which can only be downloaded when you are signed in to LendingClub and download the data.

    See the Python and R getting started kernels to get started:

    I created a git repo for the code which is used to create this data: https://github.com/nateGeorge/preprocess_lending_club_data

    Background

    I wanted an easy way to share all the lending club data with others. Unfortunately, the data on their site is fragmented into many smaller files. There is another lending club dataset on Kaggle, but it wasn't updated in years. It seems like the "Kaggle Team" is updating it now. I think it also doesn't include the full rejected loans, which are included here. It seems like the other dataset confusingly has some of the rejected loans mixed into the accepted ones. Now there are a ton of other LendingClub datasets on here too, most of which seem to have no documentation or explanation of what the data actually is.

    Content

    The definitions for the fields are on the LendingClub site, at the bottom of the page. Kaggle won't let me upload the .xlsx file for some reason since it seems to be in multiple other data repos. This file seems to be in the other main repo, but again, it's better to get it directly from the source.

    Unfortunately, there is (maybe "was" now?) a limit of 500MB for dataset files, so I had to compress the files with gzip in the Python pandas package.

    I cleaned the data a tiny bit: I removed percent symbols (%) from int_rate and revol_util columns in the accepted loans and converted those columns to floats.

    Update

    The URL column is in the dataset for completeness, as of 2018 Q2.

  9. Data Science Jobs on JobsDB Hong Kong

    • kaggle.com
    zip
    Updated Dec 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aster Fung (2022). Data Science Jobs on JobsDB Hong Kong [Dataset]. https://www.kaggle.com/datasets/asterfung/ds-obsdbhk
    Explore at:
    zip(240474 bytes)Available download formats
    Dataset updated
    Dec 9, 2022
    Authors
    Aster Fung
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Area covered
    Hong Kong
    Description

    Since we are here, we could agree that data science is a cool profession. ​ When "Data Science is the sexist job of the 21 century" by Harvard Business Review took a storm on the internet on 2012, Hong Kong was still starting out on the big data trend. So, 10 years has past, what is the status quo of the data industry? To gain insights to answer this question, I have scrapped job postings from one of popular job posting platform in Hong Kong and perform some data analysis.

    the Dataset

    A python script was written (here) to scrape job postings under "Information Technology\Data Scientist" category. The script was implemented in a way that for each job posting was read, an observation ( row of data) is added to the csv file. Preliminary data cleaning was incorporated to the script to make the dataset easier for downstream processing.

    The columns are

    columnnull placeholdernon-null example
    title(not applicable)Data Analyst - Top ranked Virtual Bank
    salary"salary"HK$35,000 - HK$55,000 /month
    company"company"CGP
    posted(not applicable)2022-11-18
    District"district"Shatin district
    job description(not applicable)Job Description: Research, collate, obtain and analyze data ...
    Career levelemptyEntry Level
    Years of ExperienceemptyN/A
    Company Websiteemptywww.companyname.com
    QualificationemptyDegree
    Job TypeemptyFull Time, Permanent
    Job FunctionsemptyBanking / Finance, Others, Information Technology (IT), Others, Data Scientist
    urlemptyhttps://hk.jobsdb.com/hk/en/job/data-analyst-data-governance-...

    the Notebook

    I have also written a kaggle notebook to analyse this dataset (click here)

    License : Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

    Everyone is welcome to use this dataset as long as not for commerical purpose you may cite this dataset at: Aster Fung. (2022) Data Science Jobs on JobsDB Hong Kong (1) Retrieved from https://www.kaggle.com/datasets/asterfung/ds-obsdbhk

    Thanks for checking by : D

  10. Newborn Health Monitoring Dataset

    • kaggle.com
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arif Miah (2025). Newborn Health Monitoring Dataset [Dataset]. https://www.kaggle.com/datasets/miadul/newborn-health-monitoring-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arif Miah
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📌 Introduction

    This dataset is a synthetic yet realistic simulation of newborn baby health monitoring.
    It is designed for healthcare analytics, machine learning, and app development, especially for early detection of newborn health risks.

    The dataset mimics daily health records of newborn babies, including vital signs, growth parameters, feeding patterns, and risk classification labels.

    🎯 Motivation

    Newborn health is one of the most sensitive areas of healthcare.
    Monitoring newborns can help detect jaundice, infections, dehydration, and respiratory issues early.

    Since real newborn data is private and hard to access, this dataset provides a safe and realistic alternative for researchers, students, and developers to build and test:
    - 📊 Exploratory Data Analysis (EDA)
    - 🤖 Machine Learning classification models
    - 📱 Healthcare monitoring apps (Streamlit, Flask, Django, etc.)
    - 🏥 Predictive healthcare systems

    📂 Dataset Overview

    • Total Babies: 100
    • Monitoring Period: 30 days per baby
    • Total Records: 3,000
    • File Format: CSV
    • Synthetic Data: Generated using Python (pandas, numpy, faker) with medically-informed rules

    📑 Column Description

    🔹 Demographics

    • baby_id → Unique identifier for each baby (e.g., B001).
    • name → Randomly generated baby first name (for realism).
    • gender → Male / Female.
    • gestational_age_weeks → Gestational age at birth (normal: 37–42 weeks).
    • birth_weight_kg → Birth weight (normal range: 2.5–4.5 kg).
    • birth_length_cm → Length at birth (avg: 48–52 cm).
    • birth_head_circumference_cm → Head circumference at birth (avg: 33–35 cm).

    🔹 Daily Monitoring

    • date → Monitoring date.
    • age_days → Age of baby in days since birth.
    • weight_kg → Daily updated weight (growth trend ~25–30g/day).
    • length_cm → Daily updated body length (slow increase).
    • head_circumference_cm → Daily updated head circumference.
    • temperature_c → Body temperature in °C (normal: 36.5–37.5°C).
    • heart_rate_bpm → Heart rate (normal: 120–160 bpm).
    • respiratory_rate_bpm → Breathing rate (normal: 30–60 breaths/min).
    • oxygen_saturation → SpO₂ level (normal >95%).

    🔹 Feeding & Hydration

    • feeding_type → Breastfeeding / Formula / Mixed.
    • feeding_frequency_per_day → Number of feeds per day (normal: 8–12).
    • urine_output_count → Wet diapers/day (normal: 6–8+).
    • stool_count → Bowel movements per day (0–5 is common).

    🔹 Medical Screening

    • jaundice_level_mg_dl → Bilirubin level (normal <5, mild 5–12, severe >15).
    • apgar_score → 0–10 score at birth (only day 1).
    • immunizations_done → Yes/No (BCG, HepB, OPV on Day 1 & 30).
    • reflexes_normal → Newborn reflex check (Yes/No).

    🔹 Risk Classification

    • risk_level → Automatically assigned health status:
      • ✅ Healthy → All vitals normal.
      • ⚠️ At Risk → Mild abnormalities (e.g., mild jaundice, slight fever, SpO₂ 92–95%).
      • 🚨 Critical → Severe abnormalities (e.g., jaundice >15, SpO₂ <92, HR >180, temp >39°C).

    📊 How Data Was Generated

    The dataset was generated in Python using:
    - numpy and pandas for data simulation.
    - faker for generating baby names and dates.
    - Medically realistic rules for vitals, growth, jaundice progression, and risk classification.

    💡 Potential Applications

    • Machine Learning: Train classification models to predict newborn health risks.
    • Streamlit/Dash Apps: Build real-time newborn monitoring dashboards.
    • Healthcare Research: Study growth and vital sign patterns.
    • Education: Practice EDA, visualization, and predictive modeling on health datasets.

    📬 Author & Contact

    Created by [Arif Miah]
    I am passionate about AI, Healthcare Analytics, and App Development.
    You can connect with me:

    ⚠️ Disclaimer

    This is a synthetic dataset created for educational and research purposes only.
    It should NOT be used for actual medical diagnosis or treatment decisions.

  11. INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    • kaggle.com
    zip
    Updated Nov 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bimal Kumar Saini (2025). INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT [Dataset]. https://www.kaggle.com/datasets/bimalkumarsaini/india-electricity-and-energy-analysis-project
    Explore at:
    zip(4986654 bytes)Available download formats
    Dataset updated
    Nov 23, 2025
    Authors
    Bimal Kumar Saini
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    India
    Description

    ⚡ INDIA ELECTRICITY & ENERGY ANALYSIS PROJECT

    This repository presents an extensive data engineering, cleaning, and analytical study on India’s electricity ecosystem using Python. The project covers coal stock status, thermal power generation, renewable energy trends, energy requirements & availability, and installed capacity across states.

    The goal is to identify operational bottlenecks, resource deficits, energy trends, and support data-driven decisions in the power sector.

    📊 Electricity Data Insights & System Analysis

    The project leverages five government datasets:

    🔹 Daily Coal Stock Data

    🔹 Daily Power Generation

    🔹 Renewable Energy Production

    🔹 State-wise Energy Requirement vs Availability

    🔹 Installed Capacity Across Fuel Types

    The final analysis includes EDA, heatmaps, trend analysis, outlier detection, data-cleaning automation, and visual summaries.

    🔹 Key Features ✅ 1. Comprehensive Data Cleaning Pipeline

    Null value treatment using median/mode strategies

    Standardizing categorical inconsistencies

    Filling missing regions, states, and production values

    Date format standardization

    Removing duplicates across all datasets

    Large-scale outlier detection using custom 5×IQR logic (to preserve real-world operational variance)

    ✅ 2. Exploratory Data Analysis (EDA)

    Includes:

    Coal stock trends over years

    Daily power generation patterns

    Solar, wind, and renewable growth

    State-wise energy shortage & surplus

    Installed capacity distribution across India

    Correlation maps for all major datasets

    ✅ 3. Trend Visualizations

    📈 Coal Stock Time-Series

    🔥 Thermal Power Daily Output

    🌞 Solar & Wind Contribution Over Time

    🇮🇳 State-wise Energy Deficit Bar Chart

    🗺️ MOM Energy Requirement Heatmap

    ⚙️ Installed Capacity Share of Each State

    📌 Dashboard & Analysis Components Section Description 🔹 Coal Stock Dashboard Daily stock, consumption, transport mode, critical plants 🔹 Power Generation Capacity, planned vs actual generation 🔹 Renewable Mix Solar, wind, hydro & total RE contributions 🔹 Energy Shortfall Requirement vs availability across states 🔹 Installed Capacity Coal, Gas, Hydro, Nuclear & RES capacity stacks 🧠 Insights & Findings 🔥 Coal Stock

    Critical coal stock days observed for multiple stations

    Seasonal dips in stock days & indigenous supply shocks

    Import dependency minimal but volatile

    ⚡ Power Generation

    Thermal stations show fluctuating PLF (Plant Load Factor)

    Many states underperform planned generation

    🌞 Renewable Energy

    Solar shows continuous year-over-year growth

    Wind output peaks around monsoon months

    🔌 Energy Requirement vs Availability

    States like Delhi, Bihar, Jharkhand show intermittent deficits

    MOM heatmap highlights major seasonal spikes

    ⚙️ Installed Capacity

    Southern & Western regions dominate national capacity

    Coal remains the largest but renewable share rising rapidly

    📁 Files in This Repository File Description coal_stock.csv Cleaned coal stock dataset power_gen.csv Daily power generation data renewable_engy.csv State-wise renewable energy dataset engy_reqmt.csv Monthly requirement & availability dataset install_cpty.csv Installed capacity across fuel types electricity.ipynb Full Python EDA notebook electricity.pdf Export of full Colab notebook (code + visuals) README.md GitHub project summary

    🛠️ Technologies Used 📊 Data Analysis

    Python (Pandas, NumPy, Matplotlib, Seaborn)

    🧹 Data Cleaning

    Null Imputation

    Outlier Detection (5×IQR)

    Standardization & Encoding

    Handling Large Multi-year Datasets

    🔧 System Concepts

    Modular Python Code

    Data Pipelines & Feature Engineering

    Version Control (Git/GitHub)

    Cloud Concepts (Google Colab + Drive Integration)

    📈 Core Metrics & KPIs

    Total Stock Days

    PLF% (Plant Load Factor)

    Renewable Energy Contribution

    Energy Deficit (%)

    National Installed Capacity Share

    📚 Future Enhancements

    Build a Power BI dashboard for visual storytelling

    Integrate forecasting models (ARIMA / Prophet)

    Automate coal shortage alerts

    Add state-level energy prediction for seasonality

    Deploy the analysis as a web dashboard (Streamlit)

  12. Articles sharing and reading from CI&T DeskDrop

    • kaggle.com
    zip
    Updated Aug 27, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gabriel Moreira (2017). Articles sharing and reading from CI&T DeskDrop [Dataset]. https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop
    Explore at:
    zip(8594639 bytes)Available download formats
    Dataset updated
    Aug 27, 2017
    Authors
    Gabriel Moreira
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Context

    Deskdrop is an internal communications platform developed by CI&T, focused in companies using Google G Suite. Among other features, this platform allows companies employees to share relevant articles with their peers, and collaborate around them.

    Content

    This rich and rare dataset contains a real sample of 12 months logs (Mar. 2016 - Feb. 2017) from CI&T's Internal Communication platform (DeskDrop).
    I contains about 73k logged users interactions on more than 3k public articles shared in the platform.

    This dataset features some distinctive characteristics:

    • Item attributes: Articles' original URL, title, and content plain text are available in two languages (English and Portuguese).
    • Contextual information: Context of the users visits, like date/time, client (mobile native app / browser) and geolocation.
    • Logged users: All users are required to login in the platform, providing a long-term tracking of users preferences (not depending on cookies in devices).
    • Rich implicit feedback: Different interaction types were logged, making it possible to infer the user's level of interest in the articles (eg. comments > likes > views).
    • Multi-platform: Users interactions were tracked in different platforms (web browsers and mobile native apps)

    If you like it, please upvote!

    Take a look in these featured Python kernels:
    - Deskdrop datasets EDA: Exploratory analysis of the articles and interactions in the dataset
    - DeskDrop Articles Topic Modeling: A statistical analysis of the main articles topics using LDA
    - Recommender Systems in Python 101: A practical introduction of the main Recommender Systems approaches: Popularity model, Collaborative Filtering, Content-Based Filtering and Hybrid Filtering.

    Acknowledgements

    We thank CI&T for the support and permission to share a sample of real usage data from its internal communication platform: Deskdrop.

    Inspiration

    The two main approaches for Recommender Systems are Collaborative Filtering and Content-Based Filtering.

    In the RecSys community, there are some popular datasets available with users ratings on items (explicit feedback), like MovieLens and Netflix Prize, which are useful for Collaborative Filtering techniques.

    Therefore, it is very difficult to find open datasets with additional item attributes, which would allow the application of Content-Based filtering techniques or Hybrid approaches, specially in the domain of ephemeral textual items (eg. articles and news).

    News datasets are also reported in academic literature as very sparse, in the sense that, as users are usually not required to log in in news portals, IDs are based on device cookies, making it hard to track the users page visits in different portals, browsing sessions and devices.

    This difficult scenario for research and experiments on Content Recommender Systems was the main motivation for the sharing of this dataset.

  13. Most Popular Python Projects on GitHub (2018-)

    • kaggle.com
    zip
    Updated Feb 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    bogoconic1 (2024). Most Popular Python Projects on GitHub (2018-) [Dataset]. https://www.kaggle.com/yeoyunsianggeremie/most-popular-python-projects-on-github-2018-2023
    Explore at:
    zip(14095145 bytes)Available download formats
    Dataset updated
    Feb 3, 2024
    Authors
    bogoconic1
    Description

    [UPDATED EVERY WEEK]

    Have you wondered how popular are the Python libraries you use regularly on Kaggle (such as pandas, numpy) ?

    This dataset lists the top 100 Python projects (or libraries) PER DAY, ranked based on the number of Github Stars, starting from 18 December 2018, almost 5 years back!

    Attributes

    date: Date where the record was collected

    rank: 1-100, rank based on number of Github stars, sorted in decreasing order

    item: Python

    repo_name: Name of the Github repository of the Python project (library)

    stars: Number of stars of the github repo

    forks: Number of forks of the github repo

    language: The language the repository is written in

    repo_url: The link to the github repository

    username: Creator of the github repository

    issues: Number of active issues raised in the github repository

    last_commit: The time of the most recent commit

    description: Description of the Python project (library)

    Reference https://github.com/EvanLi/Github-Ranking

    EDA: https://www.kaggle.com/code/yeoyunsianggeremie/eda-of-popular-python-libraries-used-in-kaggle

  14. Phone Price Predict 2020-2024

    • kaggle.com
    zip
    Updated Dec 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jerowai (2024). Phone Price Predict 2020-2024 [Dataset]. https://www.kaggle.com/datasets/jerowai/phone-price-predict-2020-2024
    Explore at:
    zip(1002 bytes)Available download formats
    Dataset updated
    Dec 10, 2024
    Authors
    Jerowai
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset Overview This dataset provides a curated, example-based snapshot of selected Samsung smartphones released (or expected to be released) between 2020 and 2024. It includes various technical specifications such as camera details, processor type, RAM, internal storage, display size, GPU, battery capacity, operating system, and pricing. Note that these values are illustrative and may not reflect actual market data.

    What’s Inside?

    Phone Name & Release Year: Quickly reference the time frame and model. Camera Specs: Understand the rear camera configurations (e.g., “108+10+10+12 MP”) and compare imaging capabilities across models. Processor & GPU: Gain insights into the performance capabilities by checking the processor and graphics chip. Memory & Storage: Review RAM and internal storage options (e.g., “8 GB RAM” and “128 GB Internal Storage”). Display & Battery: Compare screen sizes (from 6.1 to over 7 inches) and battery capacities (e.g., 5000 mAh) to gauge device longevity and usability. Operating System: Note the Android version at release. Price (USD): Examine relative pricing trends over the years. How to Use This Dataset

    Exploratory Data Analysis (EDA): Use Python libraries like Pandas and Matplotlib to explore pricing trends over time, changes in camera configurations, or the evolution of battery capacities.

    Example: df.groupby('Release Year')['Price (USD)'].mean().plot(kind='bar') can show how average prices have fluctuated year to year. Feature Comparison & Filtering: Easily filter models based on specs. For instance, query phones with at least 8 GB RAM and a 5000 mAh battery to identify devices suitable for power users.

    Example: df[(df['RAM (GB)'] >= 8) & (df['Battery Capacity (mAh)'] >= 5000)] Machine Learning & Predictive Analysis: Although this dataset is example-based and not suitable for precise forecasting, you could still practice predictive modeling. For example, create a simple regression model to predict price based on features like RAM and display size.

    Example: Train a regression model (e.g., LinearRegression in scikit-learn) to see if increasing RAM or battery capacity correlates with higher prices. Comparing Release Trends: Investigate how flagship and mid-range specifications have evolved. See if there’s a noticeable shift towards larger displays, bigger batteries, or higher camera megapixels over the years.

    Recommended Tools & Libraries

    Python & Pandas: For data cleaning, manipulation, and initial analysis. Matplotlib & Seaborn: For creating visualizations to understand trends and distributions. scikit-learn: For modeling and basic predictive tasks, if you choose to use these example values as a training ground. Jupyter Notebooks or Kaggle Kernels: For interactive analysis and iterative exploration. Disclaimer This dataset is a synthetic, illustrative example and may not match real-world specifications, prices, or release timelines. It’s intended for learning, experimentation, and demonstration of various data analysis and machine learning techniques rather than as a factual source.

  15. h

    AndroidAppReviews

    • huggingface.co
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harshitha, AndroidAppReviews [Dataset]. https://huggingface.co/datasets/NovaNightshade/AndroidAppReviews
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Authors
    Harshitha
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    The function to read the data is in the first cell of the python notebook: eda.ipynb Data Structure The JSON data is organized as follows: App Names: Top-level keys represent the names of the apps (e.g., "DoorDash", "McDonald's").

    -> Score Categories: Under each app, reviews are grouped by score categories (e.g., "1", "2", "3", "4", "5").

    -> Review Lists: Each score category contains a list of reviews.
    
     ->Review Details: Each review includes:
    
      - content: The… See the full description on the dataset page: https://huggingface.co/datasets/NovaNightshade/AndroidAppReviews.
    
  16. COVID-19_IE_from_Literature_Processed_Triplets

    • kaggle.com
    zip
    Updated Apr 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Enrique Martín-López (2020). COVID-19_IE_from_Literature_Processed_Triplets [Dataset]. https://www.kaggle.com/datasets/enriquemartinlopez/covid19-ie-from-literature-processed-triplets
    Explore at:
    zip(5000276 bytes)Available download formats
    Dataset updated
    Apr 16, 2020
    Authors
    Enrique Martín-López
    Description

    Context

    This dataset is one of the products of our contribution (notebook link) to the COVID-19 Open Research Dataset Challenge (CORD-19).

    Content

    Starting from the original CORD-19 dataset of scientific articles, we have first filtered for all the articles that mention different terms for coronavirus disease and the virus that causes it. Then we have split these articles into more manageable sections, and for each section, we applied Information Extraction using Stanford's OpenIE to extract IE triplets: (object, relation, subject).

    Acknowledgements

    Inspiration

    Originally, we used this dataset to answer the questions in the CORD-19 challenge. However, this same dataset can be used to answer any other question relating to the knowledge about coronavirus disease form scientific literature up to date.

  17. 160k Spotify songs from 1921 to 2020 (Sorted)

    • kaggle.com
    Updated Sep 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    FCPercival (2022). 160k Spotify songs from 1921 to 2020 (Sorted) [Dataset]. https://www.kaggle.com/datasets/fcpercival/160k-spotify-songs-sorted
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 17, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    FCPercival
    Description

    This is an analysis of the data on Spotify tracks from 1921-2020 with Jupyter Notebook and Python Data Science tools.

    About the Dataset

    The Spotify dataset (titled data.csv) consists of 160,000+ tracks sorted by name, from 1921-2020 found in Spotify as of June 2020. Collected by Kaggle user and Turkish Data Scientist Yamaç Eren Ay, the data was retrieved and tabulated from the Spotify Web API. Each row in the dataset corresponds to a track, with variables such as the title, artist, and year located in their respective columns. Aside from the fundamental variables, musical elements of each track, such as the tempo, danceability, and key, were likewise extracted; the algorithm for these values were generated by Spotify based on a range of technical parameters.

    Exploratory Data Analysis (EDA)

    1. Studying the correlations between the variables in the Spotify data.
    2. The evolution of different musical elements through the years.
    3. The divide between explicit and non-explicit songs through the years.

    Further Investigation and Inference (FII)

    1. Determining if there is a significant difference in popularity between explicit and non-explicit songs.
    2. Finding the most frequent emotions in Spotify tracks and analyzing their musical elements based on the track's mode and key.
    3. Determining the classifications of the Spotify tracks through K-Means Clustering.

    Project Directory Guide

    1. Spotify Data.ipynb is the main notebook where the data is imported for EDA and FII.
    2. data.csv is the dataset downloaded from Kaggle.
    3. spotify_eda.html is the HTML file for the comprehensive EDA done using the Pandas Profiling module.

    Project Notes

    1. This is in partial fulfillment of the course Statistical Modelling and Simulation (CSMODEL).

    Credits to gabminamedez for the original dataset.

  18. Snitch Clothing Sales

    • kaggle.com
    zip
    Updated Jul 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NayakGanesh007 (2025). Snitch Clothing Sales [Dataset]. https://www.kaggle.com/datasets/nayakganesh007/snitch-clothing-sales/discussion
    Explore at:
    zip(62616 bytes)Available download formats
    Dataset updated
    Jul 23, 2025
    Authors
    NayakGanesh007
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🧥 Snitch Fashion Sales (Uncleaned) Dataset 📌 Context This is a synthetic dataset representing sales transactions from Snitch, a fictional Indian clothing brand. The dataset simulates real-world retail sales data with uncleaned records, designed for learners and professionals to practice data cleaning, exploratory data analysis (EDA), and dashboard building using tools like Python, Power BI, or Excel.

    📊 What You’ll Find The dataset includes over 2,500 records of fashion product sales across various Indian cities. It contains common data issues such as:

    Missing values

    Incorrect date formats

    Duplicates

    Typos in categories and city names

    Unrealistic discounts and profit values

    🧾 Columns Explained Column --Description Order_ID ------Unique ID for each sale (some duplicates) Customer_Name ------Name of the customer (inconsistent formatting) Product_Category ---Clothing category (e.g., T-Shirts, Jeans — includes typos) Product_Name -----Specific product sold Units_Sold --Quantity sold (some negative or null) Unit_Price --Price per unit (some missing or zero) Discount_% ----Discount applied (some >100% or missing) Sales_Amount ------Total revenue after discount (some miscalculations) Order_Date ---------Order date (multiple formats or missing) City -------Indian city (includes typos like "Hyd", "bengaluru") Segment----- Market segment (B2C, B2B, or missing) Profit ---------Profit made on the sale (some unrealistic/negative)

    💡 How to Use This Dataset Clean and standardize messy data

    Convert dates and correct formats

    Perform EDA to find:

    Top-selling categories

    Impact of discounts on sales and profits

    Monthly/quarterly trends

    Segment-based performance

    Create dashboards in Power BI or Excel Pivot Table

    Document findings in a PDF/Markdown report

    🎯 Ideal For Aspiring data analysts and data scientists

    Excel / Power BI dashboard learners

    Portfolio project creators

    Kaggle competitions or practice

    📌 License This is a synthetic dataset created for educational use only. No real customer or business data is included.

  19. IMDb - List of movies by Genre

    • kaggle.com
    zip
    Updated Sep 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anu selvamathi.J.B (2022). IMDb - List of movies by Genre [Dataset]. https://www.kaggle.com/datasets/anuselvamathi/imdb-list-of-movies-by-genre
    Explore at:
    zip(1420888 bytes)Available download formats
    Dataset updated
    Sep 25, 2022
    Authors
    Anu selvamathi.J.B
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Dataset contains 5 list of movies for Genre - Action, Sci-fi, Comedy, Animation, Adventure which has been taken from IMDb site. Used Beautiful Soup python library to scrap the data from HTML See how by clicking here!

    Fields/columns - {Movie title: name of the movie; Released year - year it got released; IMDb rating: Movie rating given by IMDb}

    This dataset can be used for exploratory data analysis (EDA) or for model building. Try it!

    Note: Need slight data cleaning before use.

  20. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Shrishti Manja (2024). Ecommerce Dataset for Data Analysis [Dataset]. https://www.kaggle.com/datasets/shrishtimanja/ecommerce-dataset-for-data-analysis/code
Organization logo

Ecommerce Dataset for Data Analysis

Exploratory Data Analysis, Data Visualisation and Machine Learning

Explore at:
zip(2028853 bytes)Available download formats
Dataset updated
Sep 19, 2024
Authors
Shrishti Manja
Description

This dataset contains 55,000 entries of synthetic customer transactions, generated using Python's Faker library. The goal behind creating this dataset was to provide a resource for learners like myself to explore, analyze, and apply various data analysis techniques in a context that closely mimics real-world data.

About the Dataset: - CID (Customer ID): A unique identifier for each customer. - TID (Transaction ID): A unique identifier for each transaction. - Gender: The gender of the customer, categorized as Male or Female. - Age Group: Age group of the customer, divided into several ranges. - Purchase Date: The timestamp of when the transaction took place. - Product Category: The category of the product purchased, such as Electronics, Apparel, etc. - Discount Availed: Indicates whether the customer availed any discount (Yes/No). - Discount Name: Name of the discount applied (e.g., FESTIVE50). - Discount Amount (INR): The amount of discount availed by the customer. - Gross Amount: The total amount before applying any discount. - Net Amount: The final amount after applying the discount. - Purchase Method: The payment method used (e.g., Credit Card, Debit Card, etc.). - Location: The city where the purchase took place.

Use Cases: 1. Exploratory Data Analysis (EDA): This dataset is ideal for conducting EDA, allowing users to practice techniques such as summary statistics, visualizations, and identifying patterns within the data. 2. Data Preprocessing and Cleaning: Learners can work on handling missing data, encoding categorical variables, and normalizing numerical values to prepare the dataset for analysis. 3. Data Visualization: Use tools like Python’s Matplotlib, Seaborn, or Power BI to visualize purchasing trends, customer demographics, or the impact of discounts on purchase amounts. 4. Machine Learning Applications: After applying feature engineering, this dataset is suitable for supervised learning models, such as predicting whether a customer will avail a discount or forecasting purchase amounts based on the input features.

This dataset provides an excellent sandbox for honing skills in data analysis, machine learning, and visualization in a structured but flexible manner.

This is not a real dataset. This dataset was generated using Python's Faker library for the sole purpose of learning

Search
Clear search
Close search
Google apps
Main menu