Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The reference for the dataset and the dashboard was Youtube Channel codebasics. I have used a fictitious company called Atlix where the Sales Director want the sales data to be in a proper format which can help in decision making.
We have a total of 5 tables namely customers, products, markets, date & transactions. The data is exported from Mysql to Tableau.
In tableau , inner joins were used.
In the transactions table, we notice that sum sales amount figures are either negative or zero while the sales qty is either 1 or more. This cannot be right. Therefore, we filter the sales amount table in Tableau by having the least sales amount as minimum 1.
When currency column from transactions table was grouped in MySql, we could see ‘USD’ and ‘INR’ showing up. We cannot have a sales data showing two currencies. This was rectified by converting the USD sales amount into INR by taking the latest exchange rate at Rs.81.
We make the above change in tableau by creating a new calculated field called ‘Normalised Sales Amount’. If [Sales Amount] == ‘USD’ then [Sales Amount] * 81 else [Sales Amount] End.
Conclusion: The dashboard prepared is an interactive dashboard with filters. For eg. By Clicking on Mumbai under “Sales by Markets” we will see the results change in the other charts as well as they Will now show the results pertaining only to Mumbai. This can be done by year , month, customers , products etc. Parameter with filter has also been created for top customers and top products. This produces a slider which can be used to view the top 10 customers and products and slide it accordingly.
Following information can be passed on to the sales team or director.
Total Sales: from Jun’17 to Feb’20 has been INR 12.83 million. There is a drop of 57% in the sales revenue from 2018 to 2019. The year 2020 has not been considered as it only account for 2 months data. Markets: Mumbai which is the top most performing market and accounts for 51% of the total sales market has seen a drop in sales of almost 64% from 2018 to 2019. Top Customers: Path was on 2nd position in terms of sales in the year 2018. It accounted for 19% of the total sales after Electricalslytical which accounted for 21% of the total sales. But in year 2019, both Electricalslytical and Path were the 2nd and 4th highest customers by sales. By targeting the specific markets and customers through new ideas such as promotions, discounts etc we can look to reverse the trend of decreasing sales.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Discover the booming Data Preparation Tools market! Learn about its 18.5% CAGR, key players (Microsoft, Tableau, IBM), and regional growth trends from our comprehensive analysis. Explore market segments, drivers, and restraints shaping this crucial sector for businesses of all sizes.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Description:
The myusabank.csv dataset contains daily financial data for a fictional bank (MyUSA Bank) over a two-year period. It includes various key financial metrics such as interest income, interest expense, average earning assets, net income, total assets, shareholder equity, operating expenses, operating income, market share, and stock price. The data is structured to simulate realistic scenarios in the banking sector, including outliers, duplicates, and missing values for educational purposes.
Potential Student Tasks:
Data Cleaning and Preprocessing:
Exploratory Data Analysis (EDA):
Calculating Key Performance Indicators (KPIs):
Building Tableau Dashboards:
Forecasting and Predictive Modeling:
Business Insights and Reporting:
Educational Goals:
The dataset aims to provide hands-on experience in data preprocessing, analysis, and visualization within the context of banking and finance. It encourages students to apply data science techniques to real-world financial data, enhancing their skills in data-driven decision-making and strategic analysis.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Data Preparation Software market is poised for substantial growth, projected to reach an estimated $613 million in 2025 with a compelling Compound Annual Growth Rate (CAGR) of 8.5% through 2033. This robust expansion is fueled by the escalating volume and complexity of data generated across all industries, necessitating efficient tools for cleaning, transforming, and enriching raw data into usable formats for analytics and decision-making. Large enterprises, in particular, are significant adopters, leveraging these solutions to manage vast datasets and derive actionable insights. However, the Small and Medium-sized Enterprises (SMEs) segment is emerging as a key growth driver, as more businesses recognize the competitive advantage that well-prepared data offers, even with limited IT resources. The prevalent trend towards cloud-based solutions further democratizes access to advanced data preparation capabilities, offering scalability and flexibility that are crucial in today's dynamic business environment. Key market drivers include the increasing demand for data-driven decision-making, the growing adoption of business intelligence and advanced analytics, and the need for regulatory compliance. Trends such as the integration of AI and machine learning within data preparation tools to automate repetitive tasks, the rise of self-service data preparation for business users, and the focus on data governance and quality are shaping the market landscape. While the market exhibits strong growth, potential restraints could include the high initial cost of some sophisticated solutions and the need for skilled personnel to fully leverage their capabilities. Geographically, North America and Europe are expected to continue their dominance, driven by established technological infrastructure and a strong analytics culture. However, the Asia Pacific region is anticipated to witness the fastest growth due to rapid digital transformation and increasing data generation. Here's a comprehensive report description on Data Preparation Software, incorporating your specified elements:
This report provides an in-depth analysis of the global Data Preparation Software market, projecting a robust growth trajectory from a Base Year of 2025 through a Forecast Period of 2025-2033. The Study Period covers 2019-2033, with a particular focus on the Estimated Year of 2025 and the Historical Period of 2019-2024. We project the market to reach substantial valuations, with the global market size estimated to be over $500 million in 2025, and poised for significant expansion in the coming decade.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.
Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.
The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm
The companion paper can be found here: doi.org/10.5281/zenodo.814979
Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922
Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Discover the booming Data Preparation Platform market! Our analysis reveals a projected $30B market by 2033, driven by cloud adoption, AI integration, and growing data volumes. Explore key trends, leading companies (Microsoft, Tableau, Alteryx), and regional insights in this comprehensive market report.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data preparation tools market is experiencing robust growth, driven by the exponential increase in data volume and velocity across various industries. The rising need for data quality and consistency, coupled with the increasing adoption of advanced analytics and business intelligence solutions, fuels this expansion. A CAGR of, let's assume, 15% (a reasonable estimate given the rapid technological advancements in this space) between 2019 and 2024 suggests a significant market expansion. This growth is further amplified by the increasing demand for self-service data preparation tools that empower business users to access and prepare data without needing extensive technical expertise. Major players like Microsoft, Tableau, and Alteryx are leading the charge, continuously innovating and expanding their offerings to cater to diverse industry needs. The market is segmented based on deployment type (cloud, on-premise), organization size (small, medium, large enterprises), and industry vertical (BFSI, healthcare, retail, etc.), creating lucrative opportunities across various segments. However, challenges remain. The complexity of integrating data preparation tools with existing data infrastructures can pose implementation hurdles for certain organizations. Furthermore, the need for skilled professionals to manage and utilize these tools effectively presents a potential restraint to wider adoption. Despite these obstacles, the long-term outlook for the data preparation tools market remains highly positive, with continuous innovation in areas like automated data preparation, machine learning-powered data cleansing, and enhanced collaboration features driving further growth throughout the forecast period (2025-2033). We project a market size of approximately $15 billion in 2025, considering a realistic growth trajectory and the significant investment made by both established players and emerging startups.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Data Preparation Platform market is poised for substantial growth, estimated to reach $15,600 million by the study's end in 2033, up from $6,000 million in the base year of 2025. This trajectory is fueled by a Compound Annual Growth Rate (CAGR) of approximately 12.5% over the forecast period. The proliferation of big data and the increasing need for clean, usable data across all business functions are primary drivers. Organizations are recognizing that effective data preparation is foundational to accurate analytics, informed decision-making, and successful AI/ML initiatives. This has led to a surge in demand for platforms that can automate and streamline the complex, time-consuming process of data cleansing, transformation, and enrichment. The market's expansion is further propelled by the growing adoption of cloud-based solutions, offering scalability, flexibility, and cost-efficiency, particularly for Small & Medium Enterprises (SMEs). Key trends shaping the Data Preparation Platform market include the integration of AI and machine learning for automated data profiling and anomaly detection, enhanced collaboration features to facilitate teamwork among data professionals, and a growing focus on data governance and compliance. While the market exhibits robust growth, certain restraints may temper its pace. These include the complexity of integrating data preparation tools with existing IT infrastructures, the shortage of skilled data professionals capable of leveraging advanced platform features, and concerns around data security and privacy. Despite these challenges, the market is expected to witness continuous innovation and strategic partnerships among leading companies like Microsoft, Tableau, and Alteryx, aiming to provide more comprehensive and user-friendly solutions to meet the evolving demands of a data-driven world. Here's a comprehensive report description on Data Preparation Platforms, incorporating the requested information, values, and structure:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Original Credit goes to: Oleh Onyshchak
Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download
rawData (.CSVs) Information:
"This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.
The historic data was retrieved from Yahoo finance via yfinance python package."
Each file contains data from 01/04/2016 to 04/01/2020.
cleanData (.CSVs) & .ipynb (Python code) Information:
This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:
My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1
Enjoy!
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The Data Preparation Tools market is booming, projected to reach $3 billion by 2025 with a 17.7% CAGR. Discover key trends, drivers, and restraints shaping this dynamic industry, including regional market share and leading companies like Microsoft, Tableau, and Alteryx. Explore the impact of self-service tools and cloud adoption.
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
Discover the booming Data Preparation Platform market! Learn about its $15 billion valuation (2025), 18% CAGR, key drivers, trends, and leading players like Microsoft, Tableau, and Alteryx. Explore regional market share and growth projections to 2033. Get your insights now!
Facebook
TwitterES Encampment Cleaning Tracking Public is a hosted layer view indented for sharing Encampment Cleanup information with the public and for use within Tableau dashboards. Homeless encampment cleanup data is collected by contractors and tracking info related to cleanup efforts within encampments and perimeters.Data informs the Tidy-Up Tacoma Data Dashboard and aides in analysis of trends. Data is updated daily. For more information contact: Vicky Tirrell Business Services Analyst ES SW Operations Support Services. vtirrell@cityoftacoma.org
Facebook
Twitterhttps://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
The booming data preparation tools market, projected to reach $33.2 billion by 2033 with a 15% CAGR, is reshaping data analytics. Learn about key drivers, market segmentation (self-service, data integration, applications), leading vendors (Microsoft, Tableau, Alteryx), and regional trends influencing this rapidly evolving landscape.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.
The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.
Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.
The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.
The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```
SELECT
*
FROM
portfolioproject-350601.ChicagoCrime.Crime
LIMIT 1000;
SELECT
*
FROM
portfolioproject-350601.ChicagoCrime.Crime
WHERE
unique_key IS NULL OR
case_number IS NULL OR
date IS NULL OR
primary_type IS NULL OR
location_description IS NULL OR
arrest IS NULL OR
longitude IS NULL OR
latitude IS NULL;
DELETE FROM
portfolioproject-350601.ChicagoCrime.Crime
WHERE
unique_key IS NULL OR
case_number IS NULL OR
date IS NULL OR
primary_type IS NULL OR
location_description IS NULL OR
arrest IS NULL OR
longitude IS NULL OR
latitude IS NULL;
SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Domain-Specific Dataset and Visualization Guide
This package contains 20 realistic datasets in CSV format across different industries, along with 20 text files suggesting visualization ideas. Each dataset includes about 300 rows of synthetic but domain-appropriate data. They are designed for data analysis, visualization practice, machine learning projects, and dashboard building.
What’s inside
20 CSV files, one for each domain:
20 TXT files, each listing 10 relevant graphing options for the dataset.
MASTER_INDEX.csv, which summarizes all domains with their column names.
Use cases
Example
Education dataset has columns like StudentName, Class, Subject, Marks, AttendancePercent. Suggested graphs: bar chart of average marks by subject, scatter plot of marks vs attendance percent, line chart of attendance over time.
E-Commerce dataset has columns like OrderDate, Product, Category, Price, Quantity, Total. Suggested graphs: line chart of revenue trend, bar chart of revenue by category, pie chart of payment mode share.
Facebook
Twitterhttps://www.marketresearchforecast.com/privacy-policyhttps://www.marketresearchforecast.com/privacy-policy
The Data Prep market is booming, projected to reach $12 Billion by 2033 with a 13.7% CAGR. Discover key trends, leading companies (Alteryx, Informatica, IBM), and regional insights in this comprehensive market analysis. Learn how self-service tools and cloud solutions are transforming data preparation.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Project Introduction and Goals
This project is focused on analyzing a sales dataset using Google Sheets for data cleaning and Tableau for visualizations. The main objective is to uncover actionable insights such as top performing countries, best selling products, and monthly sales trends. I aim to present these findings through an interactive dashboard that can be used by business stakeholders for decision making.
Process Overview
Data Cleaning (Google Sheets) • Removed blank rows and filtered out missing values. • Standardized product and region names for consistency. • Split combined columns (e.g., date & time) for easier analysis. • Replaced missing or incorrect values with relevant estimates (e.g., average or “unknown”).
Exploratory Analysis • Calculated total sales by country. • Identified the best-selling products and frequent buyers. • Tracked monthly sales trends.
Visualization (Tableau)
• Created a dynamic sales dashboard including: • Line chart showing sales over time • Pie chart of product categories • Bar chart of top 10 customers by revenue • Country-wise sales comparison
Conclusion
The analysis reveals key patterns in sales distribution, highlights top contributors to revenue, and suggests areas needing attention (e.g., low-performing countries). The dashboard enables real-time filtering and deeper insight for users.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Purpose. This dataset contains anonymised raw responses (n = 55, 31 variables) from a cross-sectional survey investigating factors that influence the adoption of data-analytics tools (Excel/Sheets, Power BI/Tableau, Python notebooks, Google Analytics) among graduate students and early-career professionals in Uzbekistan.Instrument. Items operationalise seven UTAUT/TAM-based constructs: Performance Expectancy, Effort Expectancy, Behavioural Intention, Familiarity & Usage, Task–Technology Fit, Barriers to Adoption, plus Demographics (age, gender, study programme, prior stats courses, work experience). All Likert items use a five-point scale.Collection & cleaning. Data were collected via Google Forms between 02 Apr 2025 and 22 Apr 2025 through university e-mail lists, Telegram study channels, and LinkedIn posts. Five partial records (> 20 % missing) were removed; remaining open-text answers were lower-cased, spell-checked, and stemmed. The file is provided exactly as analysed in the accompanying thesis; no further processing (e.g., recoding) has been performed.File contents. survey_responses.xlsx – one worksheet (“Form Responses 1”) with 55 rows × 31 columns. Column A (“Timestamp”) shows submission time in UTC+5. Variable names follow the original question stems for transparency.Ethics & privacy. All participants gave informed e-consent; no personal identifiers (names, e-mails, IPs) are included. Ethical approval: Silk Road University REC # 2025-DX-012.
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
https://github.com/ssrAiLab/IMDB-2020-Tableau-Dashboard/blob/main/Dashboard%20Screenshot.png?raw=true" alt="Dashboard Preview">
The IMDB Top 1000 Movies of 2020 dataset provides a rich canvas for exploring the world of cinema — and this Tableau project transforms that data into stunning visuals and insights.
I’ve designed a dynamic and visually appealing dashboard using Tableau that highlights movie trends, ratings, genres, and key metrics from 2020’s cinematic landscape.
✅ Top 20 Movies by IMDB Rating
✅ Distribution of Movies by Genre
✅ Top Directors with Most Hits
✅ Language & Country-wise Movie Count
✅ Gross Earnings vs Ratings
✅ Runtime Distribution Analysis
✅ Certificate-wise Movie Breakdown
✅ Year-wise Trend in Popularity
| File | Description |
|---|---|
IMDB_2020_Dashboard.twb | Tableau workbook file |
imdb_top_1000.csv | Cleaned dataset used |
Dashboard Screenshot.png | Snapshot of the final dashboard |
archive.zip | Contains all the files in one place |
.twb file from this dataset Sahil Raj
Data Analyst | Tableau Storyteller | Movie Enthusiast 🎥
🔗 LinkedIn | GitHub | Kaggle
“Cinema is more than entertainment — it’s culture, storytelling, and data waiting to be visualized.”
📌 This project is for educational and portfolio purposes only. IMDB data is publicly available and curated for non-commercial use.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...