32 datasets found

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Stock Market Dashboard Build (Python + Tableau)
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jackmnob (2025). Stock Market Dashboard Build (Python + Tableau) [Dataset]. https://www.kaggle.com/datasets/jackmnob/stock-market-dashboard-build-python-tableau
Explore at:
zip(549379249 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
jackmnob
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original Credit goes to: Oleh Onyshchak

Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download

rawData (.CSVs) Information:

"This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.

The historic data was retrieved from Yahoo finance via yfinance python package."

Each file contains data from 01/04/2016 to 04/01/2020.

cleanData (.CSVs) & .ipynb (Python code) Information:

This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:

My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1

Enjoy!
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Visualizing Chicago Crime Data
kaggle.com
zip
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
Explore at:
zip(94861784 bytes)Available download formats
Dataset updated
Jul 1, 2022
Authors
Elijah Toumoua
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Chicago
Description
Prelude

This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

About the Dataset

Important Facts

The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

Reliability

Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

Processing the Data

Cleaning the Dataset

The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Deleting all null rows

DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Checking for any duplicates in the unique keys

None to be found

SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....
D
Data Preparation Platform Report
datainsightsmarket.com
doc, pdf, ppt
Updated Sep 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Data Preparation Platform Report [Dataset]. https://www.datainsightsmarket.com/reports/data-preparation-platform-1368457
Explore at:
doc, pdf, pptAvailable download formats
Dataset updated
Sep 20, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Data Preparation Platform market is poised for substantial growth, estimated to reach $15,600 million by the study's end in 2033, up from $6,000 million in the base year of 2025. This trajectory is fueled by a Compound Annual Growth Rate (CAGR) of approximately 12.5% over the forecast period. The proliferation of big data and the increasing need for clean, usable data across all business functions are primary drivers. Organizations are recognizing that effective data preparation is foundational to accurate analytics, informed decision-making, and successful AI/ML initiatives. This has led to a surge in demand for platforms that can automate and streamline the complex, time-consuming process of data cleansing, transformation, and enrichment. The market's expansion is further propelled by the growing adoption of cloud-based solutions, offering scalability, flexibility, and cost-efficiency, particularly for Small & Medium Enterprises (SMEs). Key trends shaping the Data Preparation Platform market include the integration of AI and machine learning for automated data profiling and anomaly detection, enhanced collaboration features to facilitate teamwork among data professionals, and a growing focus on data governance and compliance. While the market exhibits robust growth, certain restraints may temper its pace. These include the complexity of integrating data preparation tools with existing IT infrastructures, the shortage of skilled data professionals capable of leveraging advanced platform features, and concerns around data security and privacy. Despite these challenges, the market is expected to witness continuous innovation and strategic partnerships among leading companies like Microsoft, Tableau, and Alteryx, aiming to provide more comprehensive and user-friendly solutions to meet the evolving demands of a data-driven world. Here's a comprehensive report description on Data Preparation Platforms, incorporating the requested information, values, and structure:
B
To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...
borealisdata.ca
dataone.org
Updated Feb 28, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shahram Yarmand (2019). To Estimate and Optimize the Source of Drinking Water for Metro Vancouver until 2040 [Dataset]. http://doi.org/10.5683/SP2/6KU4I7
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.5683/SP2/6KU4I7
Dataset updated
Feb 28, 2019
Dataset provided by
Borealis
Authors
Shahram Yarmand
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Oct 2017 - Nov 2017
Area covered
Metro Vancouver
Description
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
Bellabeat Case Study Supplement
kaggle.com
zip
Updated Oct 28, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Britta Smith (2022). Bellabeat Case Study Supplement [Dataset]. https://www.kaggle.com/datasets/brittasmith/bellabeat-casestudy-sql-tableau-excel
Explore at:
zip(65670 bytes)Available download formats
Dataset updated
Oct 28, 2022
Authors
Britta Smith
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Raw data, clean data, and SQL query output tables as spreadsheets to support Tableau story and github repository available at https://github.com/brittabeta/Bellabeat-Case-Study-SQL-Excel-Tableau
HrDashboardTableauProject
kaggle.com
zip
Updated Apr 6, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kusamdeep Sran (2025). HrDashboardTableauProject [Dataset]. https://www.kaggle.com/datasets/kusamdeepsran/hrdashboardtableauproject
Explore at:
zip(6163326 bytes)Available download formats
Dataset updated
Apr 6, 2025
Authors
Kusamdeep Sran
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
An interactive Tableau dashboard analyzing key HR metrics—attrition, recruitment, performance, and diversity—to empower data-driven workforce decisions. Includes clean datasets, Tableau workbook (.twb/.twbx), and step-by-step insights.
Rural Route Nomad Photo and Video Collection Dataset
zenodo.org
csv
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Webber; Alan Webber (2022). Rural Route Nomad Photo and Video Collection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6818292
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6818292
Dataset updated
Jul 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alan Webber; Alan Webber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset encompasses the metadata drawn from preserving and visualizing the Rural Route Nomad Photo and Video Collection. The collection consists of 14,058 born-digital objects shot on over a dozen digital cameras in over 30 countries, on seven continents from the end of 2008 through 2009. Metadata was generated using ExifTool, along with manual means, utilizing OpenRefine and Excel to parse and clean.

The dataset was a result of an overriding project to preserve the digital content of the Rural Route Nomad Collection, and then visualize photographic specs and geographic details with charts, graphs and maps in Tableau. A description of the project as a whole is publicly forthcoming. Visualizations can be found at https://public.tableau.com/app/profile/alan.webber5364.
E
Embedded Analytics Solutions Market Report
datainsightsmarket.com
doc, pdf, ppt
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Embedded Analytics Solutions Market Report [Dataset]. https://www.datainsightsmarket.com/reports/embedded-analytics-solutions-market-13061
Explore at:
ppt, doc, pdfAvailable download formats
Dataset updated
Mar 12, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The Embedded Analytics Solutions market is experiencing robust growth, projected to reach $68.88 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 13.90%. This expansion is fueled by several key drivers. The increasing need for data-driven decision-making across various industries, coupled with the rising adoption of cloud-based solutions and the proliferation of big data, are significantly contributing to market growth. Furthermore, the growing demand for real-time business intelligence and the ease of integrating analytics directly into applications are fostering wider adoption. The market is segmented by solution (software and services), organization size (SMEs and large enterprises), deployment (cloud and on-premise), and end-user vertical (BFSI, IT & Telecommunications, Healthcare, Retail, Energy & Utilities, Manufacturing, and others). The competitive landscape is populated by established players like SAS, IBM, and Microsoft, alongside emerging innovative companies. Growth is expected to be particularly strong in North America and Europe initially, followed by increasing penetration in the Asia-Pacific region driven by technological advancements and rising digital adoption rates. The on-premise deployment model, while still significant, is gradually yielding to the cloud, driven by scalability, cost-effectiveness, and accessibility benefits. The continued growth trajectory is expected to be influenced by advancements in artificial intelligence (AI) and machine learning (ML), which will further enhance the capabilities of embedded analytics solutions. However, challenges such as data security concerns, the complexity of implementation, and the need for skilled professionals to manage and interpret data could act as potential restraints. Nevertheless, the overall market outlook remains positive, with significant opportunities for growth across all segments. The increasing emphasis on data visualization and user-friendly dashboards is also expected to further fuel market adoption, particularly amongst smaller organizations that traditionally lacked access to sophisticated analytical tools. The competitive landscape will likely witness mergers, acquisitions, and strategic partnerships as players strive to enhance their product offerings and expand their market share. Recent developments include: August 2022 - SAS and SingleStore have announced a collaboration to help organizations remove barriers to data access, maximize performance and scalability, and uncover key data-driven insights. SAS Viya with SingleStore enables the use of SAS analytics and AI technology on data stored in SingleStore's cloud-native real-time database. The integration provides flexible, open access to curated data to help accelerate value for cloud, hybrid, and on-premises deployments., July 2022 - TIBCO announced the launch of TIBCO ModelOps, which helps customers simplify and scale cloud-based analytic model management, deployment, monitoring, and governance; TIBCO ModelOps addresses the requirement for speed in deploying AI and draws from TIBCO's leadership in data science data visualization and business intelligence. This aids AI teams in confronting critical deployment hurdles like ease-of-applying analytics to applications, identification and mitigation of bias, and transparency and manageability of an algorithm's behavior within business-critical applications.. Key drivers for this market are: Increasing Demand for Advanced Analytical Techniques for Business Data, Increasing number of Data Driven Organizations; Increasing Adoption of Mobile BI and Big Data Analytics; Increasing Use of Mobile Devices and Cloud Computing Technologies. Potential restraints include: Licensing Challenges and Higher Associated Costs. Notable trends are: Increasing Use of Mobile Devices and Cloud Computing Technologies to Witness Significant Growth.
Cyclisitic Trip Data 2019 (Google)
kaggle.com
zip
Updated Aug 4, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shaine Pepper (2022). Cyclisitic Trip Data 2019 (Google) [Dataset]. https://www.kaggle.com/datasets/shainepepper/divvy-2019-trip-data-clean
Explore at:
zip(27551971 bytes)Available download formats
Dataset updated
Aug 4, 2022
Authors
Shaine Pepper
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Intro

Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.

RStudio

Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.

Step two was to remove typos or poorly named columns. colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"' colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type" colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"

Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.

library(dplyr) Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>% filter(across(where(is.character), ~ . != "NULL")) %>% type.convert(as.is = TRUE)

Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.

Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")

After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong

class(Cyclistic_Clean_v2$trip_duration)

Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.

Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration) Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)

Now that unwanted characters are gone, we can convert the column into numeric.

Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)

Doing this allows Tableau and R to read the data properly to create graphs without error.

Next I created a backup dataset incase there was any issue while exporting.

Cyclistic_Clean_v3 <- Cyclistic_Clean_v2 write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)

After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.

All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.

Tableau

Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.

Charts

Please note this are all made in tableau and meant to be interactive.

Here you can find the relation between male and female riders.
View post on imgur.com

Male vs Female tripduration with usertype
View post on imgur.com

Busiest stations filtered by months. (This is meant to be interactive.)
View post on imgur.com

Most popular starting stations.
View post on imgur.com

Most popular ending stations.
View post on imgur.com

Conclusion

My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.

Casual riders ride much longer than subscribers duration wise.

Although there are many more male riders, females tend to ride longer than males.

Stations #562 & #568 are the most busy by a h...
Steam Games from 2013 to 2023
kaggle.com
zip
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terenci Claramunt (2024). Steam Games from 2013 to 2023 [Dataset]. https://www.kaggle.com/terencicp/steam-games-december-2023
Explore at:
zip(6442898 bytes)Available download formats
Dataset updated
Jan 7, 2024
Authors
Terenci Claramunt
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a derivative dataset created for my Tableau visualisation project. It's derived from two other datasets on Kaggle:

Steam Games Dataset by Martin Bustos

Video Games on Steam [in JSON] by Sujay Kapadnis

From the Martin Bustos dataset, I removed the games without reviews and selected the most relevant features to create the following dashboard:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2Fce81900b3761554ce9acfc7ef25189b6%2Fsteam-dashboard.png?generation=1704630691045231&alt=media" alt="">

From the Sujay Kapadnis dataset I added the data on game duration from HowLongToBeat.com

The following notebooks contain exploratory data analysis and the transformations I used to generate this dataset from the two original datasets:

Steam Games - Exploratory Data Analysis

Steam Games - Data Transformation

View the live dashboard on Tableau Public:

Steam tag explorer
Industry Layoffs 2020 - 2023
kaggle.com
zip
Updated Feb 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jake Clarke (2023). Industry Layoffs 2020 - 2023 [Dataset]. https://www.kaggle.com/datasets/clarkj37/layoffs2023cleaned
Explore at:
zip(64862 bytes)Available download formats
Dataset updated
Feb 4, 2023
Authors
Jake Clarke
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset is to showcase my Google Data Analytics Capstone project by using Excel to clean data, R to analyze data for insights, and Tableau to create visualizations of the data
Superstore Dataset
kaggle.com
zip
Updated Sep 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shivam Amrutkar (2023). Superstore Dataset [Dataset]. https://www.kaggle.com/datasets/yesshivam007/superstore-dataset
Explore at:
zip(2119716 bytes)Available download formats
Dataset updated
Sep 25, 2023
Authors
Shivam Amrutkar
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
The Superstore Sales Data dataset, available in an Excel format as "Superstore.xlsx," is a comprehensive collection of sales and customer-related information from a retail superstore. This dataset comprises* three distinct tables*, each providing specific insights into the store's operations and customer interactions.
DA Analyst Capstone Project
kaggle.com
zip
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tara Jacobs (2024). DA Analyst Capstone Project [Dataset]. https://www.kaggle.com/datasets/tarajacobs/mock-user-profiles-from-social-networks
Explore at:
zip(8714 bytes)Available download formats
Dataset updated
May 18, 2024
Authors
Tara Jacobs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8"> https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618"> https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">

https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37"> https://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">

BigQuery | Big Query data Cleaning

Tableau | Creating Visuals with Tableau

Sheets | Cleaning NULL Values , creating data tables

R studio | Organizing and cleaning data to create a visual code

SQL SSMS | Transform, clean and manipulate Data

Linkedin | Survey Poll

https://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641"> https://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">

Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site

https://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d"> https://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087"> https://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270"> https://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985"> https://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497"> https://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">
USA Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). USA Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/usa-weekly-real-estate-listings
Explore at:
zip(66961155 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
United States
Description
These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.

These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.

Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.

The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.

See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.

Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.
Top 100 TV Shows
kaggle.com
zip
Updated Jun 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Jae Hwan Kim (2021). Top 100 TV Shows [Dataset]. https://www.kaggle.com/jackjaehwankim/top-100-tv-shows
Explore at:
zip(2581 bytes)Available download formats
Dataset updated
Jun 27, 2021
Authors
Jack Jae Hwan Kim
Description
Context

This is my personal project which I analyzed the main factor that leads me to select the TV show. This time, I used Python for web scrapping (or known as crawling) the data from IMDb.com and used Spreadsheet to clean the dataset. Finally, I used Tableau to visualize the data.

This time, I've utilized web-crawling to build up the database. For this project, I gathered the data from the top 100 TV shows listed by the user named 'carlosotsubo' from IMDB.com.

Content

tv_show: titles

season_years: it ranges from the beginning year to the ending year.

Note: some TV shows are still ongoing.

first_season_yr: the beginning year of the season

last_season_yr: the final or ending year of the season

running_time_min: the running time of TV show per episode

genre: in this dataset, it would be the main genre

subgenre1: subgenre #1

subgenre2: subgenre #2

imdb_rating: ratings by IMDb members

watched_yn: the determining factor on whether I've watched or not.

Acknowledgements

I sincerely thank the IMDb user named, 'carlosotsubo,' for providing the list of top 100 TV shows.

Inspiration

The following questions need to be answered:

How do I choose which TV show to watch?

Does running time also affect my decision to watch the show?

If not, would the genre be the main factor that affects my decision?

Data Visualization

After my own analysis, I've created the data visualization:

https://public.tableau.com/app/profile/jae.hwan.kim/viz/HowdoIchoosewhichTVshowtowatch/Dashboard1

If you guys give me feedback, I will be glad to hear! Thanks!
Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)
kaggle.com
zip
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yeshang Upadhyay (2025). Divvy Trips Clean Dataset (Nov 2024 – Oct 2025) [Dataset]. https://www.kaggle.com/datasets/yeshangupadhyay/divvy-trips-clean-dataset-nov-2024-oct-2025
Explore at:
zip(170259034 bytes)Available download formats
Dataset updated
Nov 14, 2025
Authors
Yeshang Upadhyay
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
📌 Overview

This dataset contains a cleaned and transformed version of the public Divvy Bicycle Sharing Trip Data covering the period November 2024 to October 2025.

The original raw data is publicly released by the Chicago Open Data Portal, and has been cleaned using Pandas (Python) and DuckDB SQL for faster analysis.
This dataset is now ready for direct use in: - Exploratory Data Analysis (EDA) - SQL analytics - Machine learning - Time-series/trend analysis - Dashboard creation (Power BI / Tableau)

📂 Source

Original Data Provider:
Chicago Open Data Portal – Divvy Trips
License: Open Data Commons Public Domain Dedication (PDDL)
This cleaned dataset only contains transformations; no proprietary or restricted data is included.

🔧 Cleaning & Transformations Performed

Combined monthly CSVs (Nov 2024 → Oct 2025)

Removed duplicates

Standardized datetime formats

Created new fields:

ride_length

day_of_week

hour_of_day

Handled missing or null values

Cleaned inconsistent station names

Filtered invalid ride durations (negative or zero-length rides)

Exported as a compressed .csv for optimized performance

📊 Columns in the Dataset

ride_id

rideable_type

started_at

ended_at

start_station_name

end_station_name

start_lat

start_lng

end_lat

end_lng

member_casual

ride_length (minutes)

day_of_week

hour_of_day

💡 Use Cases

This dataset is suitable for: - DuckDB + SQL analytics - Pandas EDA - Visualization in Power BI, Tableau, Looker - Statistical analysis - Member vs. Casual rider behavioral analysis - Peak usage prediction

📝 Notes

This dataset is not the official Divvy dataset, but a cleaned, transformed, and analysis-ready version created for educational and analytical use.
Artstation
kaggle.com
zip
Updated May 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitriy Zub (2021). Artstation [Dataset]. https://www.kaggle.com/dimitryzub/artstation
Explore at:
zip(4067138 bytes)Available download formats
Dataset updated
May 28, 2021
Authors
Dmitriy Zub
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Contains links only as this script to extract data was used for a freelance project.

Content

100.000 artwork links (just links). 50.000 artworks were scraped and contain data, ~40.000+ unique (artwork from the same artist).

Context

While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection with.

Data set includes such columns: - Role - Company work at (if mentioned or extracted) - Date artwork was posted - Number of views - Number of likes - Number of comments - Which software was used - Which tags were used - Artwork title - Artwork URL

As you see the disclaimer, it's the first time I'm doing this. I want anyone who will be using this dataset to keep artists privacy by not using artist's email addresses in any way even though it's publicly available data published by them. Correct me if I said something wrong here.

Code

Code that used to extract data from the Artstation you can find here, in the GitHub repository.

Inspiration

While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection. I really enjoyed seeing progression in the 3D world (games, feature films, etc).

Goals

The goal of this project was to better understand the process of gathering data, processing, cleaning, analyzing, and visualizing. Besides that, I wanted to understand what is the most popular software, tag, affiliation among artists.

Tools used

To scrape data these Python libraries/packages were used: - requests - json - googlesheets api - selenium - regex

To clean, analyze and visualize data: - googlesheets - tableau

Visualization

Note: following visualizations contains data bias. Not every tag, affiliation has taken to count due to the difficulties of data extraction, and the mistakes I made.

Tableau public dashboard

https://user-images.githubusercontent.com/78694043/119978304-23cb0380-bfc2-11eb-8b70-e84100fa7630.png" alt="image">

https://user-images.githubusercontent.com/78694043/119978269-1ada3200-bfc2-11eb-981f-b8ad2c2c0ff1.png" alt="image">

https://user-images.githubusercontent.com/78694043/119978237-101f9d00-bfc2-11eb-9285-e0d9bcf688ee.png" alt="image">
France Weekly Real Estate Listings 2022-2023
kaggle.com
zip
Updated Apr 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Artur Dragunov (2024). France Weekly Real Estate Listings 2022-2023 [Dataset]. https://www.kaggle.com/datasets/arturdragunov/france-weekly-real-estate-listings-2022-2023
Explore at:
zip(2750497 bytes)Available download formats
Dataset updated
Apr 3, 2024
Authors
Artur Dragunov
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
France
Description
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.

The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.

For exact column descriptions, see columns for France_clean_unique.csv and my thesis.

Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.

If you want to continue generating datasets yourself, see my Github Repository for code inspiration.

Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Data Cleaning

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Clear search

Close search

Google apps

Main menu

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Stock Market Dashboard Build (Python + Tableau)

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

Visualizing Chicago Crime Data

Prelude

About the Dataset

Important Facts

Reliability

Processing the Data

Cleaning the Dataset

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

Deleting all null rows

Checking for any duplicates in the unique keys

None to be found

Data Preparation Platform Report

To Estimate and Optimize the Source of Drinking Water for Metro Vancouver...

Bellabeat Case Study Supplement

HrDashboardTableauProject

Rural Route Nomad Photo and Video Collection Dataset

Embedded Analytics Solutions Market Report

Cyclisitic Trip Data 2019 (Google)

Intro

RStudio

Tableau

Charts

Conclusion

Steam Games from 2013 to 2023

Industry Layoffs 2020 - 2023

Superstore Dataset

DA Analyst Capstone Project

USA Weekly Real Estate Listings 2022-2023

Top 100 TV Shows

Context

Content

Acknowledgements

Inspiration

Data Visualization

Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)

📌 Overview

📂 Source

🔧 Cleaning & Transformations Performed

📊 Columns in the Dataset

💡 Use Cases

📝 Notes

Artstation

Content

Context

Code

Inspiration

Goals

Tools used

Visualization

France Weekly Real Estate Listings 2022-2023

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Data Cleaning