Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains synthetic (fake) clinical data created solely for educational purposes. It is designed to help learners practice data cleaning, preprocessing, and exploratory data analysis using Python, Pandas, or other data science tools. ** ⚠️ Disclaimer:** All the records in this dataset are randomly generated and do NOT represent any real individuals, patients, or organizations. Any resemblance to actual persons, living or dead, is purely coincidental. This dataset is safe to use publicly for tutorials, projects, and demonstrations.
Use Cases:
Exploratory Data Analysis (EDA)
Learning how to handle missing values, duplicates, and data inconsistencies
Practice for academic projects or YouTube tutorials
Building machine learning pipelines with safe dummy data
Dataset Structure: - Column Name Description patient_id Unique ID for each dummy patient assigned_sex Gender (Male/Female) given_name Randomly generated first name surname Randomly generated last name address Fake street address for demonstration city Random synthetic city name state State code (e.g., CA, TX, NY) zip_code Fake 5-digit ZIP code country Country (set as "United States" or similar placeholder) contact Fake phone number + email format birthdate Randomly generated birthdate (1970–2000) weight Weight of patient (kg) height Height of patient (inches/cm) bmi Calculated Body Mass Index
Facebook
TwitterThis video series presents 11 lessons and introduction to data literacy organized by the Open Development Cambodia Organization (ODC) to provide video tutorials on data literacy and the use of data in data storytelling. There are 12 videos which illustrate following sessions: * Introduction to the data literacy course * Lesson 1: Understanding data * Lesson 2: Explore data tables and data products * Lesson 3: Advanced Google Search * Lesson 4: Navigating data portals and validating data * Lesson 5: Common data format * Lesson 6: Data standard * Lesson 7: Data cleaning with Google Sheets * Lesson 8: Basic statistic * Lesson 9: Basic Data analysis using Google Sheets * Lesson 10: Data visualization * Lesson 11: Data Visualization with Flourish
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset is used in a data cleaning project based on the raw data from Alex the Analyst's Power BI tutorial series. The original dataset can be found here.
The dataset is employed in a mini project that involves cleaning and preparing data for analysis. It is part of a series of exercises aimed at enhancing skills in data cleaning using Pandas.
The dataset contains information related to [provide a brief description of the data, e.g., sales, customer information, etc.]. The columns cover various aspects such as [list key columns and their meanings].
The original dataset is sourced from Alex the Analyst's Power BI tutorial series. Special thanks to [provide credit or acknowledgment] for making the dataset available.
If you use this dataset in your work, please cite it as follows:
Feel free to reach out for any additional information or clarification. Happy analyzing!
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset represents a medium-sized Canadian bookstore business operating three retail locations across Calgary (Downtown, NW, SE) and a central warehouse.
It covers 2019 to 2024, including the COVID-19 impact years (2020-2021) and post-pandemic recovery with inflation-adjusted growth. The data integrates finance, operations, HR, and customer analytics, perfect for data portfolio projects with specfic , KPI tracking, and realistic bookkeeping simulations.
Time span: 2019 – 2024
Locations: Calgary -> Downtown (DT), NW, SE
Currency: Canadian Dollars (CAD)
Tax context: Alberta GST 5 %, no provincial PST
Inflation factor: 1.00 → 1.18 (2019 → 2024) applied to payroll, sales, and loan interest
This dataset is fully synthetic and designed for: - Business intelligence dashboards - Machine learning demos (forecasting, regression, clustering) - Financial and accounting analysis training - Data-cleaning and EDA (Exploratory Data Analysis) tutorials
This dataset is released under the MIT License, free to use for research, learning, or commercial purposes.
Photo: by Pixabay, free to use.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
A comprehensive Amazon books dataset featuring 20,000 books and 727,876 reviews spanning 26 years (1997-2023), paired with a complete step-by-step data science tutorial. Perfect for learning data analytics from scratch or conducting advanced book market analysis.
What's Included:
Raw Data: 20K book metadata (titles, authors, prices, ratings, descriptions) + 727K detailed reviews Complete Tutorial Series: 4 progressive Python scripts covering data loading, cleaning, exploratory analysis, and visualization Ready-to-Run Code: Fully documented scripts with practice exercises Educational Focus: Designed for ENTR 3901 coursework but suitable for all skill levels Key Features:
Real-world e-commerce data (pre-filtered for quality: 200+ reviews, $5+ price) Comprehensive documentation and setup instructions Generates 6+ professional visualizations Includes bonus analysis challenges (sentiment analysis, price optimization, time patterns) Perfect for business analytics, market research, and data science education Use Cases:
Learning data analytics fundamentals Book market analysis and trends Customer behavior insights Price optimization studies Review sentiment analysis Academic coursework and projects This dataset bridges the gap between raw data and practical learning, making it ideal for both beginners and experienced analysts looking to explore e-commerce patterns in the publishing industry.
Facebook
TwitterThe 2010 NEDS is similar to the 2004 Nigeria DHS EdData Survey (NDES) in that it was designed to provide information on education for children age 4–16, focusing on factors influencing household decisions about children’s schooling. The survey gathers information on adult educational attainment, children’s characteristics and rates of school attendance, absenteeism among primary school pupils and secondary school students, household expenditures on schooling and other contributions to schooling, and parents’/guardians’ perceptions of schooling, among other topics.The 2010 NEDS was linked to the 2008 Nigeria Demographic and Health Survey (NDHS) in order to collect additional education data on a subset of the households (those with children age 2–14) surveyed in the 2008 Nigeria DHS survey. The 2008 NDHS, for which data collection was carried out from June to October 2008, was the fourth DHS conducted in Nigeria (previous surveys were implemented in 1990, 1999, and 2003).
The goal of the 2010 NEDS was to follow up with a subset of approximately 30,000 households from the 2008 NDHS survey. However, the 2008 NDHS sample shows that of the 34,070 households interviewed, only 20,823 had eligible children age 2–14. To make statistically significant observations at the State level, 1,700 children per State and the Federal Capital Territory (FCT) were needed. It was estimated that an additional 7,300 households would be required to meet the total number of eligible children needed. To bring the sample size up to the required target, additional households were screened and added to the overall sample. However, these households did not have the NDHS questionnaire administered. Thus, the two surveys were statistically linked to create some data used to produce the results presented in this report, but for some households, data were imputed or not included.
National
Households Individuals
Sample survey data [ssd]
The eligible households for the 2010 NEDS are the same as those households in the 2008 NDHS sample for which interviews were completed and in which there is at least one child age 2-14, inclusive. In the 2008 NDHS, 34,070 households were successfully interviewed, and the goal here was to perform a follow-up NEDS on a subset of approximately 30,000 households. However, records from the 2008 NDHS sample showed that only 20,823 had children age 4-16. Therefore, to bring the sample size up to the required number of children, additional households were screened from the NDHS clusters.
The first step was to use the NDHS data to determine eligibility based on the presence of a child age 2-14. Second, based on a series of precision and power calculations, RTI determined that the final sample size should yield approximately 790 households per State to allow statistical significance for reporting at the State level, resulting in a total completed sample size of 790 × 37 = 29,230. This calculation was driven by desired estimates of precision, analytic goals, and available resources. To achieve the target number of households with completed interviews, we increased the final number of desired interviews to accommodate expected attrition factors such as unlocatable addresses, eligibility issues, and non-response or refusal. Third, to reach the target sample size, we selected additional samples from households that had been listed by NDHS but had not been sampled and visited for interviews. The final number of households with completed interviews was 26,934 slightly lower than the original target, but sufficient to yield interview data for 71,567 children, well above the targeted number of 1,700 children per State.
Face-to-face [f2f]
The four questionnaires used in the 2004 Nigeria DHS EdData Survey (NDES)— 1. Household Questionnaire 2. Parent/Guardian Questionnaire 3. Eligible Child Questionnaire 4. Independent Child Questionnaire—formed the basis for the 2010 NEDS questionnaires. These are all available in Appendix D of the survey report available under External Resources.
More than 90 percent of the questionnaires remained the same; for cases where there was a clear justification or a need for a change in item formulation or a specific requirement for additional items, these were updated accordingly. A one day workshop was convened with the NEDS Implementation Team and the NDES Advisory Committee to review the instruments and identify any needed revisions, additions, or deletions. Efforts were made to collect data to ease integration of the 2010 NEDS data into the FMOE’s national education management information system. Instrument issues that were identified as being problematic in the 2004 NDES as well as items identified as potentially confusing or difficult were proposed for revision. Issues that USAID, DFID, FMOE, and other stakeholders identified as being essential but not included in the 2004 NDES questionnaires were proposed for incorporation into the 2010 NEDS instruments, with USAID serving as the final arbiter regarding questionnaire revisions and content.
General revisions accepted into the questionnaires included the following: - A separation of all questions related to secondary education into junior secondary and senior secondary to reflect the UBE policy - Administration of school-based questions for children identified as attending pre-school - Inclusion of questions on disabilities of children and parents - Additional questions on Islamic schooling - Revision to the literacy question administration to assess English literacy for children attending school - Some additional questions on delivery of UBE under the financial questions section
Upon completion of revisions to the English-language questionnaires, the instruments were translated and adapted by local translators into three languages—Hausa, Igbo, and Yoruba—and then back-translated into English to ensure accuracy of the translation. After the questionnaires were finalized, training materials used in the 2004 NDES and developed by Macro International, which included training guides, data collection manuals, and field observation materials, were reviewed. The materials were updated to reflect changes in the questionnaires. In addition, the procedures as described in the manuals and guides were carefully reviewed. Adjustments were made, where needed, based on experience on large-scale survey and lessons learned from the 2004 NDES and the 2008 NDHS, to ensure the highest quality data capture.
Data processing for the 2010 NEDS occurred concurrently with data collection. Completed questionnaires were retrieved by the field coordinators/trainers and delivered to NPC in standard envelops, labeled with the sample identification, team, and State name. The shipment also contained a written summary of any issues detected during the data collection process. The questionnaire administrators logged the receipt of the questionnaires, acknowledged the list of issues, and acted upon them if required. The editors performed an initial check on the questionnaires, performed any coding of open-ended questions (with possible assistance from the data entry operators), and left them available to be assigned to the data entry operators. The data entry operators entered the data into the system, with the support of the editors for erroneous or unclear data.
Experienced data entry personnel were recruited from those who have performed data entry activities for NPC on previous studies. The data entry teams composed a data entry coordinator, supervisor and operators. Data entry coordinators oversaw the entire data entry process from programming and training to final data cleaning, made assignments, tracked progress, and ensured the quality and timeliness of the data entry process. Data entry supervisors were on hand at all times to ensure that proper procedures were followed and to help editors resolve any uncovered inconsistencies. The supervisors controlled incoming questionnaires, assigned batches of questionnaires to the data entry operators, and managed their progress. Approximately 30 clerks were recruited and trained as data entry operators to enter all completed questionnaires and to perform the secondary entry for data verification. Editors worked with the data entry operators to review information flagged as “erroneous” or “dubious” in the data entry process and provided follow up and resolution for those anomalies.
The data entry program developed for the 2004 NDES was revised to reflect the revisions in the 2010 NEDS questionnaire. The electronic data entry and reporting system ensured internal consistency and inconsistency checks.
A very high overall response rate of 97.9 percent was achieved with interviews completed in 26,934 households out of a total of 27,512 occupied households from the original sample of 28,624 households. The response rates did not vary significantly by urban–rural (98.5 percent versus 97.6 percent, respectively). The response rates for parent/guardians and children were even higher, and the rate for independent children was slightly lower than the overall sample rate, 97.4 percent. In all these cases, the urban/rural differences were negligible.
Estimates derived from a sample survey are affected by two types of errors: (1) non-sampling errors and (2) sampling errors. Non-sampling errors are the results of mistakes made in implementing data collection and data processing, such as
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains cleaned and structured information about popular movies. It was processed using Python and Pandas to remove null values, fix inconsistent formats, and convert date columns to proper datetime types.
The dataset includes attributes such as:
🎬 Movie title
⭐ Average rating
🗓️ Release date (converted to datetime)
🌍 Country of origin
🗣️ Spoken languages
This cleaned dataset can be used for:
Exploratory Data Analysis (EDA)
Visualization practice
Machine Learning experiments
Data cleaning and preprocessing tutorials
Source: IMDb Top Movies (via API / educational purpose)
Last Updated: November 2025
Facebook
TwitterOverview
This dataset contains daily weather observations, including temperature, wind speed, and weather events recorded over multiple days. It is a simple and clean dataset suitable for beginners and intermediate users who want to practice data cleaning, handling missing values, exploratory data analysis (EDA), visualization, and basic predictive modeling.
Dataset Structure
Each row represents a single day's weather record.
Columns
day Date of the observation.
temperature — Recorded temperature of the day (in °F). windspeed — Wind speed of the day (in mph). event — Weather event such as Rain, Sunny, or Snow.
Key Characteristics
Contains missing values in temperature, windspeed, and event columns. Useful for practicing:
Data cleaning and imputation Time-series formatting Handling categorical data Basic statistical analysis Simple forecasting tasks
Intended Use
This dataset is suitable for educational and demonstration purposes, including:
Data preprocessing tutorials Pandas practice notebooks Visualization exercises Introductory machine learning tasks
Facebook
TwitterThis dataset provides five years of daily stock market data for Infosys Ltd. (INFY) — one of India’s largest multinational IT services and consulting firms.
It contains key daily metrics such as Open, High, Low, Close prices, and Trading Volume, covering the period from Oct 2020 to Oct 2025.
The dataset is ideal for financial time series analysis, machine learning forecasts, algorithmic trading strategies, and investment research
📅 Dataset Summary Column Name Description Date Trading date in YYYY-MM-DD format Ticker Stock symbol (INFY) representing Infosys Ltd. Open Opening price of the stock on the given day High Highest price reached during the trading session Low Lowest price reached during the trading session Close Closing price at the end of the trading day Volume Number of shares traded on that day
File name: INFY_5years_data.csv Format: CSV (UTF-8 encoded) Period covered: ~2020–2025 Records: ~1,250 rows (approx. 250 trading days per year × 5 years) Columns: 7 (Date, Ticker, Open, High, Low, Close, Volume)
🔍 Potential Use Cases You can use this dataset for:
📊 Trend analysis – identify price patterns and seasonality 🤖 Machine learning – build stock price prediction or volatility models 💡 Investment strategy testing – simulate buy/sell signals (using moving averages, RSI, etc.) 🧩 Time-series forecasting – using ARIMA, LSTM, or Prophet models 🎓 Educational projects – financial analytics and data cleaning tutorials
Facebook
TwitterBank Data Analysis | Real World Project | Power BI In this Visualization, I have followed the process of analyzing Bank dataset using Microsoft Power BI. I have started by importing the data into Power BI and then i performed the data cleaning, transformation, and visualization on the given data to gain insights and create a comprehensive analysis report.
Here i have created the insightful visualizations and interactive reports that can be used for business intelligence and decision-making purposes.
Data Set: Took the support from tutorial by Data Visionary.
You tube Video referred: https://www.youtube.com/watch?v=GZqBefbNP10&t=1581s
Analysis done and Visualization shown on: 1. Balance by Age and Gender 2. Number of Customers by Age and Gender 3. Number of Customers by Region 4. Balance by Region 5. Number of Customers by JobType 6. Balance by Gender 7. Total Customers Joined 8. Cards- i) Max Balance by Age ii) Min Balance by Age iii) Max Customers by Gender
Dear All, Kindly go through the same and please provide me the suggestions and guide me for any changes required and correct me where i need to improve.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This data set and associated notebooks are meant to give you a head start in accessing the RTEM Hackathon by showing some examples of data extraction, processing, cleaning, and visualisation. Data availabe in this Kaggle page is only a selected part of the whole data set extracted for the tutorials. A series of Video Tutorials are associated with this dataset and notebooks and is found on the Onboard YouTube channel.
An introduction to the API usage and how to retrieve data from it. This notebook is outlined in several YouTube videos that discuss: - how to get started with your account and get oriented to the Kaggle environment, - get acquainted with the Onboard API, - and start using the Onboard API wrapper to extract and explore data.
How to query data points meta-data, process them and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to get started exploring building metadata/points, - select/merge point lists and export as CSV - and visualize and explore the point lists
How to query time-series from data points, process and visually explore them. This notebook is outlined in several YouTube videos that discuss: - how to load and filter time-series data from sensors - resample and transform time-series data - and create heat maps and boxplots of data for exploration
A quick example of a starting point towards the analysis of the data for some sort of solution and reference to a paper that might help get an overview of the possible directions your team can go in. This notebook is outlined in several YouTube videos that discuss: - overview of use cases and judging criteria - an example of a real-world hypothesis - further development of that simple example
More information about the data and competition can be found on the RTEM Hackathon website.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains information about 1,000 books from an online bookstore, including their titles, prices, availability, ratings, categories, and product page URLs. It is ideal for projects involving:
Web scraping and data extraction tutorials
Natural Language Processing (e.g. analyzing book titles)
E-commerce data analysis and visualization
Recommendation systems based on category or price
Data cleaning and preprocessing practice
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset contains customer demographic and behavioral information designed for exploring segmentation, clustering, and predictive analytics in retail and marketing contexts. It provides a simple yet powerful foundation for practicing data science techniques such as K-Means clustering, customer profiling, and recommendation systems.
### Dataset Features
- CustomerID: Unique identifier for each customer
- Genre: Gender of the customer (Male/Female)
- Age: Age of the customer (years)
- Annual Income (k$): Annual income in thousands of dollars
- Spending Score: A score assigned by the business based on customer behavior and spending patterns
Notes
- Some records contain missing values (nan) in Age, Annual Income, or Spending Score. These can be handled using imputation, removal, or advanced techniques depending on the analysis.
- Spending Score is an arbitrary metric often used in clustering exercises to simulate customer engagement.
### Potential Use Cases
- Customer Segmentation: Apply clustering algorithms (e.g., K-Means, DBSCAN) to group customers by income and spending habits.
- Marketing Strategy: Identify high-value customers and tailor promotions.
- Predictive Modeling: Build models to predict spending behavior based on demographics.
- Data Cleaning Practice: Handle missing values and prepare the dataset for machine learning tasks.
This dataset is widely used in machine learning tutorials and business analytics projects because it is small, interpretable, and directly applicable to real-world scenarios like retail customer analysis. It’s ideal for beginners learning clustering and for professionals prototyping segmentation strategies.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Titanic dataset is one of the most iconic and frequently used datasets in the data science and machine learning community. It originates from the tragic sinking of the RMS Titanic on April 15, 1912, after it struck an iceberg during its maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making it one of the deadliest commercial peacetime maritime disasters in modern history.
This dataset provides detailed information on a subset of the passengers aboard the Titanic and is primarily used to build predictive models to determine whether a passenger survived or not, based on the available features. It is a supervised learning problem, specifically a binary classification task, where the target variable is Survived (1 = Yes, 0 = No).
Purpose and Use Cases
The Titanic dataset is commonly used for: - Learning data preprocessing techniques such as handling missing values, encoding categorical variables, and feature scaling - Performing exploratory data analysis (EDA) and creating visualizations - Engineering new features from existing data to enhance model performance - Training and evaluating various classification models such as Logistic Regression, Decision Trees, Random Forests, and XGBoost - Benchmarking classification pipelines in data science competitions, especially on platforms like Kaggle
Key Features / Columns
Challenges and Considerations
Why It's Popular
The Titanic dataset is based on a real-world historical event, making it intuitive and engaging for learners. It is especially suitable for beginners looking to understand the end-to-end machine learning pipeline. The dataset's moderate size and feature variety encourage experimentation in data cleaning, transformation, visualization, and modeling. It is frequently used in online tutorials, courses, and machine learning competitions to demonstrate model development and evaluation practices.
Facebook
Twitterhttp://www.gnu.org/licenses/fdl-1.3.htmlhttp://www.gnu.org/licenses/fdl-1.3.html
Small toy data inspired by ITSM (IT service management) tickets. Including noisy labels, multiple languages and missing data on purpose. Here is one data examination and cleaning procedure written by me:
Feel free to add yours!
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.