32 datasets found
  1. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  2. Stock Market Dashboard Build (Python + Tableau)

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jackmnob (2025). Stock Market Dashboard Build (Python + Tableau) [Dataset]. https://www.kaggle.com/datasets/jackmnob/stock-market-dashboard-build-python-tableau
    Explore at:
    zip(549379249 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    jackmnob
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original Credit goes to: Oleh Onyshchak

    Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download

    rawData (.CSVs) Information:

    "This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.

    The historic data was retrieved from Yahoo finance via yfinance python package."

    Each file contains data from 01/04/2016 to 04/01/2020.

    cleanData (.CSVs) & .ipynb (Python code) Information:

    This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:

    My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1

    Enjoy!

  3. Z

    IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Indiana University
    Authors
    Cains, Mariana; Anand, Srini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

    Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

    The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

    The companion paper can be found here: doi.org/10.5281/zenodo.814979

    Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

    Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)

  4. Visualizing Chicago Crime Data

    • kaggle.com
    zip
    Updated Jul 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
    Explore at:
    zip(94861784 bytes)Available download formats
    Dataset updated
    Jul 1, 2022
    Authors
    Elijah Toumoua
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Prelude

    This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

    About the Dataset

    Important Facts

    The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

    Reliability

    Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

    Processing the Data

    Cleaning the Dataset

    The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

    The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

    Examining the dataset

    There are over 7.5 million rows of data

    Putting a limit so it does not take a long time to run

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

    Seeing which points are null

    There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

    Most of the null points are in the lat and long, which we will need later

    Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Deleting all null rows

    DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
    unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Checking for any duplicates in the unique keys

    None to be found

    SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....

  5. D

    Data Preparation Platform Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Sep 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Preparation Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/data-preparation-platform-1368457
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    Sep 20, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Data Preparation Platform market is poised for substantial growth, estimated to reach $15,600 million by the study's end in 2033, up from $6,000 million in the base year of 2025. This trajectory is fueled by a Compound Annual Growth Rate (CAGR) of approximately 12.5% over the forecast period. The proliferation of big data and the increasing need for clean, usable data across all business functions are primary drivers. Organizations are recognizing that effective data preparation is foundational to accurate analytics, informed decision-making, and successful AI/ML initiatives. This has led to a surge in demand for platforms that can automate and streamline the complex, time-consuming process of data cleansing, transformation, and enrichment. The market's expansion is further propelled by the growing adoption of cloud-based solutions, offering scalability, flexibility, and cost-efficiency, particularly for Small & Medium Enterprises (SMEs). Key trends shaping the Data Preparation Platform market include the integration of AI and machine learning for automated data profiling and anomaly detection, enhanced collaboration features to facilitate teamwork among data professionals, and a growing focus on data governance and compliance. While the market exhibits robust growth, certain restraints may temper its pace. These include the complexity of integrating data preparation tools with existing IT infrastructures, the shortage of skilled data professionals capable of leveraging advanced platform features, and concerns around data security and privacy. Despite these challenges, the market is expected to witness continuous innovation and strategic partnerships among leading companies like Microsoft, Tableau, and Alteryx, aiming to provide more comprehensive and user-friendly solutions to meet the evolving demands of a data-driven world. Here's a comprehensive report description on Data Preparation Platforms, incorporating the requested information, values, and structure:

  6. B

    To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

    • borealisdata.ca
    • dataone.org
    Updated Feb 28, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 28, 2019
    Dataset provided by
    Borealis
    Authors
    Shahram Yarmand
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Oct 2017 - Nov 2017
    Area covered
    Metro Vancouver
    Description

    The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).

  7. Bellabeat Case Study Supplement

    • kaggle.com
    zip
    Updated Oct 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Britta Smith (2022). Bellabeat Case Study Supplement [Dataset]. https://www.kaggle.com/datasets/brittasmith/bellabeat-casestudy-sql-tableau-excel
    Explore at:
    zip(65670 bytes)Available download formats
    Dataset updated
    Oct 28, 2022
    Authors
    Britta Smith
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Raw data, clean data, and SQL query output tables as spreadsheets to support Tableau story and github repository available at https://github.com/brittabeta/Bellabeat-Case-Study-SQL-Excel-Tableau

  8. HrDashboardTableauProject

    • kaggle.com
    zip
    Updated Apr 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kusamdeep Sran (2025). HrDashboardTableauProject [Dataset]. https://www.kaggle.com/datasets/kusamdeepsran/hrdashboardtableauproject
    Explore at:
    zip(6163326 bytes)Available download formats
    Dataset updated
    Apr 6, 2025
    Authors
    Kusamdeep Sran
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    An interactive Tableau dashboard analyzing key HR metrics—attrition, recruitment, performance, and diversity—to empower data-driven workforce decisions. Includes clean datasets, Tableau workbook (.twb/.twbx), and step-by-step insights.

  9. Rural Route Nomad Photo and Video Collection Dataset

    • zenodo.org
    csv
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Webber; Alan Webber (2022). Rural Route Nomad Photo and Video Collection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6818292
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alan Webber; Alan Webber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset encompasses the metadata drawn from preserving and visualizing the Rural Route Nomad Photo and Video Collection. The collection consists of 14,058 born-digital objects shot on over a dozen digital cameras in over 30 countries, on seven continents from the end of 2008 through 2009. Metadata was generated using ExifTool, along with manual means, utilizing OpenRefine and Excel to parse and clean.

    The dataset was a result of an overriding project to preserve the digital content of the Rural Route Nomad Collection, and then visualize photographic specs and geographic details with charts, graphs and maps in Tableau. A description of the project as a whole is publicly forthcoming. Visualizations can be found at https://public.tableau.com/app/profile/alan.webber5364.

  10. E

    Embedded Analytics Solutions Market Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Mar 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Embedded Analytics Solutions Market Report [Dataset]. https://www.datainsightsmarket.com/reports/embedded-analytics-solutions-market-13061
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Mar 12, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Embedded Analytics Solutions market is experiencing robust growth, projected to reach $68.88 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 13.90%. This expansion is fueled by several key drivers. The increasing need for data-driven decision-making across various industries, coupled with the rising adoption of cloud-based solutions and the proliferation of big data, are significantly contributing to market growth. Furthermore, the growing demand for real-time business intelligence and the ease of integrating analytics directly into applications are fostering wider adoption. The market is segmented by solution (software and services), organization size (SMEs and large enterprises), deployment (cloud and on-premise), and end-user vertical (BFSI, IT & Telecommunications, Healthcare, Retail, Energy & Utilities, Manufacturing, and others). The competitive landscape is populated by established players like SAS, IBM, and Microsoft, alongside emerging innovative companies. Growth is expected to be particularly strong in North America and Europe initially, followed by increasing penetration in the Asia-Pacific region driven by technological advancements and rising digital adoption rates. The on-premise deployment model, while still significant, is gradually yielding to the cloud, driven by scalability, cost-effectiveness, and accessibility benefits. The continued growth trajectory is expected to be influenced by advancements in artificial intelligence (AI) and machine learning (ML), which will further enhance the capabilities of embedded analytics solutions. However, challenges such as data security concerns, the complexity of implementation, and the need for skilled professionals to manage and interpret data could act as potential restraints. Nevertheless, the overall market outlook remains positive, with significant opportunities for growth across all segments. The increasing emphasis on data visualization and user-friendly dashboards is also expected to further fuel market adoption, particularly amongst smaller organizations that traditionally lacked access to sophisticated analytical tools. The competitive landscape will likely witness mergers, acquisitions, and strategic partnerships as players strive to enhance their product offerings and expand their market share. Recent developments include: August 2022 - SAS and SingleStore have announced a collaboration to help organizations remove barriers to data access, maximize performance and scalability, and uncover key data-driven insights. SAS Viya with SingleStore enables the use of SAS analytics and AI technology on data stored in SingleStore's cloud-native real-time database. The integration provides flexible, open access to curated data to help accelerate value for cloud, hybrid, and on-premises deployments., July 2022 - TIBCO announced the launch of TIBCO ModelOps, which helps customers simplify and scale cloud-based analytic model management, deployment, monitoring, and governance; TIBCO ModelOps addresses the requirement for speed in deploying AI and draws from TIBCO's leadership in data science data visualization and business intelligence. This aids AI teams in confronting critical deployment hurdles like ease-of-applying analytics to applications, identification and mitigation of bias, and transparency and manageability of an algorithm's behavior within business-critical applications.. Key drivers for this market are: Increasing Demand for Advanced Analytical Techniques for Business Data, Increasing number of Data Driven Organizations; Increasing Adoption of Mobile BI and Big Data Analytics; Increasing Use of Mobile Devices and Cloud Computing Technologies. Potential restraints include: Licensing Challenges and Higher Associated Costs. Notable trends are: Increasing Use of Mobile Devices and Cloud Computing Technologies to Witness Significant Growth.

  11. Cyclisitic Trip Data 2019 (Google)

    • kaggle.com
    zip
    Updated Aug 4, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shaine Pepper (2022). Cyclisitic Trip Data 2019 (Google) [Dataset]. https://www.kaggle.com/datasets/shainepepper/divvy-2019-trip-data-clean
    Explore at:
    zip(27551971 bytes)Available download formats
    Dataset updated
    Aug 4, 2022
    Authors
    Shaine Pepper
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Intro

    Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.

    RStudio

    Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.

    Step two was to remove typos or poorly named columns. colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"' colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"

    Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.

    library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)

    Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.

    Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")

    After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong

    class(Cyclistic_Clean_v2$trip_duration)

    Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.

    Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration) Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)

    Now that unwanted characters are gone, we can convert the column into numeric.

    Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)

    Doing this allows Tableau and R to read the data properly to create graphs without error.

    Next I created a backup dataset incase there was any issue while exporting.

    Cyclistic_Clean_v3 <- Cyclistic_Clean_v2 write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)

    After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.

    All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.

    Tableau

    Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.

    Charts

    Please note this are all made in tableau and meant to be interactive.

    Here you can find the relation between male and female riders.

    View post on imgur.com

    Male vs Female tripduration with usertype

    View post on imgur.com

    Busiest stations filtered by months. (This is meant to be interactive.)

    View post on imgur.com

    Most popular starting stations.

    View post on imgur.com

    Most popular ending stations.

    View post on imgur.com

    Conclusion

    My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.

    1. Casual riders ride much longer than subscribers duration wise.
    2. Although there are many more male riders, females tend to ride longer than males.
    3. Stations #562 & #568 are the most busy by a h...
  12. Steam Games from 2013 to 2023

    • kaggle.com
    zip
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terenci Claramunt (2024). Steam Games from 2013 to 2023 [Dataset]. https://www.kaggle.com/terencicp/steam-games-december-2023
    Explore at:
    zip(6442898 bytes)Available download formats
    Dataset updated
    Jan 7, 2024
    Authors
    Terenci Claramunt
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a derivative dataset created for my Tableau visualisation project. It's derived from two other datasets on Kaggle:

    Steam Games Dataset by Martin Bustos

    Video Games on Steam [in JSON] by Sujay Kapadnis

    From the Martin Bustos dataset, I removed the games without reviews and selected the most relevant features to create the following dashboard:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2Fce81900b3761554ce9acfc7ef25189b6%2Fsteam-dashboard.png?generation=1704630691045231&alt=media" alt="">

    From the Sujay Kapadnis dataset I added the data on game duration from HowLongToBeat.com

    The following notebooks contain exploratory data analysis and the transformations I used to generate this dataset from the two original datasets:

    Steam Games - Exploratory Data Analysis

    Steam Games - Data Transformation

    View the live dashboard on Tableau Public:

    Steam tag explorer

  13. Industry Layoffs 2020 - 2023

    • kaggle.com
    zip
    Updated Feb 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jake Clarke (2023). Industry Layoffs 2020 - 2023 [Dataset]. https://www.kaggle.com/datasets/clarkj37/layoffs2023cleaned
    Explore at:
    zip(64862 bytes)Available download formats
    Dataset updated
    Feb 4, 2023
    Authors
    Jake Clarke
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is to showcase my Google Data Analytics Capstone project by using Excel to clean data, R to analyze data for insights, and Tableau to create visualizations of the data

  14. Superstore Dataset

    • kaggle.com
    zip
    Updated Sep 25, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Shivam Amrutkar (2023). Superstore Dataset [Dataset]. https://www.kaggle.com/datasets/yesshivam007/superstore-dataset
    Explore at:
    zip(2119716 bytes)Available download formats
    Dataset updated
    Sep 25, 2023
    Authors
    Shivam Amrutkar
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    The Superstore Sales Data dataset, available in an Excel format as "Superstore.xlsx," is a comprehensive collection of sales and customer-related information from a retail superstore. This dataset comprises* three distinct tables*, each providing specific insights into the store's operations and customer interactions.

  15. DA Analyst Capstone Project

    • kaggle.com
    zip
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tara Jacobs (2024). DA Analyst Capstone Project [Dataset]. https://www.kaggle.com/datasets/tarajacobs/mock-user-profiles-from-social-networks
    Explore at:
    zip(8714 bytes)Available download formats
    Dataset updated
    May 18, 2024
    Authors
    Tara Jacobs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Screenshot 2024-05-18 213045https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8"> Screenshot 2024-05-18 213108https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618"> Screenshot 2024-05-18 213137https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">

    Screenshot 2024-05-19 090932https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">

    BigQuery | Big Query data Cleaning

    Tableau | Creating Visuals with Tableau

    Sheets | Cleaning NULL Values , creating data tables

    R studio | Organizing and cleaning data to create a visual code

    SQL SSMS | Transform, clean and manipulate Data

    Linkedin | Survey Poll

    imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">

    Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site

    imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">

  16. USA Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). USA Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/usa-weekly-real-estate-listings
    Explore at:
    zip(66961155 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    United States
    Description

    These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.

    These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.

    Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.

    The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.

    See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.

    Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.

  17. Top 100 TV Shows

    • kaggle.com
    zip
    Updated Jun 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Jae Hwan Kim (2021). Top 100 TV Shows [Dataset]. https://www.kaggle.com/jackjaehwankim/top-100-tv-shows
    Explore at:
    zip(2581 bytes)Available download formats
    Dataset updated
    Jun 27, 2021
    Authors
    Jack Jae Hwan Kim
    Description

    Context

    This is my personal project which I analyzed the main factor that leads me to select the TV show. This time, I used Python for web scrapping (or known as crawling) the data from IMDb.com and used Spreadsheet to clean the dataset. Finally, I used Tableau to visualize the data.

    This time, I've utilized web-crawling to build up the database. For this project, I gathered the data from the top 100 TV shows listed by the user named 'carlosotsubo' from IMDB.com.

    Content

    1. tv_show: titles
    2. season_years: it ranges from the beginning year to the ending year.
      • Note: some TV shows are still ongoing.
    3. first_season_yr: the beginning year of the season
    4. last_season_yr: the final or ending year of the season
    5. running_time_min: the running time of TV show per episode
    6. genre: in this dataset, it would be the main genre
    7. subgenre1: subgenre #1
    8. subgenre2: subgenre #2
    9. imdb_rating: ratings by IMDb members
    10. watched_yn: the determining factor on whether I've watched or not.

    Acknowledgements

    I sincerely thank the IMDb user named, 'carlosotsubo,' for providing the list of top 100 TV shows.

    Inspiration

    The following questions need to be answered:

    1. How do I choose which TV show to watch?
    2. Does running time also affect my decision to watch the show?
    3. If not, would the genre be the main factor that affects my decision?

    Data Visualization

    After my own analysis, I've created the data visualization:

    https://public.tableau.com/app/profile/jae.hwan.kim/viz/HowdoIchoosewhichTVshowtowatch/Dashboard1

    If you guys give me feedback, I will be glad to hear! Thanks!

  18. Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)

    • kaggle.com
    zip
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yeshang Upadhyay (2025). Divvy Trips Clean Dataset (Nov 2024 – Oct 2025) [Dataset]. https://www.kaggle.com/datasets/yeshangupadhyay/divvy-trips-clean-dataset-nov-2024-oct-2025
    Explore at:
    zip(170259034 bytes)Available download formats
    Dataset updated
    Nov 14, 2025
    Authors
    Yeshang Upadhyay
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    📌 Overview

    This dataset contains a cleaned and transformed version of the public Divvy Bicycle Sharing Trip Data covering the period November 2024 to October 2025.

    The original raw data is publicly released by the Chicago Open Data Portal, and has been cleaned using Pandas (Python) and DuckDB SQL for faster analysis.
    This dataset is now ready for direct use in: - Exploratory Data Analysis (EDA) - SQL analytics - Machine learning - Time-series/trend analysis - Dashboard creation (Power BI / Tableau)

    📂 Source

    Original Data Provider:
    Chicago Open Data Portal – Divvy Trips
    License: Open Data Commons Public Domain Dedication (PDDL)
    This cleaned dataset only contains transformations; no proprietary or restricted data is included.

    🔧 Cleaning & Transformations Performed

    • Combined monthly CSVs (Nov 2024 → Oct 2025)
    • Removed duplicates
    • Standardized datetime formats
    • Created new fields:
      • ride_length
      • day_of_week
      • hour_of_day
    • Handled missing or null values
    • Cleaned inconsistent station names
    • Filtered invalid ride durations (negative or zero-length rides)
    • Exported as a compressed .csv for optimized performance

    📊 Columns in the Dataset

    • ride_id
    • rideable_type
    • started_at
    • ended_at
    • start_station_name
    • end_station_name
    • start_lat
    • start_lng
    • end_lat
    • end_lng
    • member_casual
    • ride_length (minutes)
    • day_of_week
    • hour_of_day

    💡 Use Cases

    This dataset is suitable for: - DuckDB + SQL analytics - Pandas EDA - Visualization in Power BI, Tableau, Looker - Statistical analysis - Member vs. Casual rider behavioral analysis - Peak usage prediction

    📝 Notes

    This dataset is not the official Divvy dataset, but a cleaned, transformed, and analysis-ready version created for educational and analytical use.

  19. Artstation

    • kaggle.com
    zip
    Updated May 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitriy Zub (2021). Artstation [Dataset]. https://www.kaggle.com/dimitryzub/artstation
    Explore at:
    zip(4067138 bytes)Available download formats
    Dataset updated
    May 28, 2021
    Authors
    Dmitriy Zub
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Contains links only as this script to extract data was used for a freelance project.

    Content

    100.000 artwork links (just links). 50.000 artworks were scraped and contain data, ~40.000+ unique (artwork from the same artist).

    Context

    While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection with.

    Data set includes such columns: - Role - Company work at (if mentioned or extracted) - Date artwork was posted - Number of views - Number of likes - Number of comments - Which software was used - Which tags were used - Artwork title - Artwork URL

    As you see the disclaimer, it's the first time I'm doing this. I want anyone who will be using this dataset to keep artists privacy by not using artist's email addresses in any way even though it's publicly available data published by them. Correct me if I said something wrong here.

    Code

    Code that used to extract data from the Artstation you can find here, in the GitHub repository.

    Inspiration

    While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection. I really enjoyed seeing progression in the 3D world (games, feature films, etc).

    Goals

    The goal of this project was to better understand the process of gathering data, processing, cleaning, analyzing, and visualizing. Besides that, I wanted to understand what is the most popular software, tag, affiliation among artists.

    Tools used

    To scrape data these Python libraries/packages were used: - requests - json - googlesheets api - selenium - regex

    To clean, analyze and visualize data: - googlesheets - tableau

    Visualization

    Note: following visualizations contains data bias. Not every tag, affiliation has taken to count due to the difficulties of data extraction, and the mistakes I made.

    Tableau public dashboard

    https://user-images.githubusercontent.com/78694043/119978304-23cb0380-bfc2-11eb-8b70-e84100fa7630.png" alt="image">

    https://user-images.githubusercontent.com/78694043/119978269-1ada3200-bfc2-11eb-981f-b8ad2c2c0ff1.png" alt="image">

    https://user-images.githubusercontent.com/78694043/119978237-101f9d00-bfc2-11eb-9285-e0d9bcf688ee.png" alt="image">

  20. France Weekly Real Estate Listings 2022-2023

    • kaggle.com
    zip
    Updated Apr 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
    Explore at:
    zip(2750497 bytes)Available download formats
    Dataset updated
    Apr 3, 2024
    Authors
    Artur Dragunov
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Area covered
    France
    Description

    These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

    The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

    For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

    Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

    If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

    Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
Organization logo

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Explore at:
zip(276607 bytes)Available download formats
Dataset updated
Aug 26, 2022
Authors
Abdulrasaq Ariyo
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates
--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3 

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast 
UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...
Search
Clear search
Close search
Google apps
Main menu