16 datasets found

Netflix Data: Cleaning, Analysis and Visualization

kaggle.com

zip

Updated Aug 26, 2022

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Stock Market Dashboard Build (Python + Tableau)
kaggle.com
zip
Updated Feb 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
jackmnob (2025). Stock Market Dashboard Build (Python + Tableau) [Dataset]. https://www.kaggle.com/datasets/jackmnob/stock-market-dashboard-build-python-tableau
Explore at:
zip(549379249 bytes)Available download formats
Dataset updated
Feb 27, 2025
Authors
jackmnob
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Original Credit goes to: Oleh Onyshchak

Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download

rawData (.CSVs) Information:

"This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.

The historic data was retrieved from Yahoo finance via yfinance python package."

Each file contains data from 01/04/2016 to 04/01/2020.

cleanData (.CSVs) & .ipynb (Python code) Information:

This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:

My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1

Enjoy!
Visualizing Chicago Crime Data
kaggle.com
zip
Updated Jul 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
Explore at:
zip(94861784 bytes)Available download formats
Dataset updated
Jul 1, 2022
Authors
Elijah Toumoua
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Area covered
Chicago
Description
Prelude

This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

About the Dataset

Important Facts

The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

Reliability

Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

Processing the Data

Cleaning the Dataset

The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Deleting all null rows

DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

Checking for any duplicates in the unique keys

None to be found

SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....
Z
IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...
data.niaid.nih.gov
nde-dev.biothings.io
+2more
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Indiana University
Authors
Cains, Mariana; Anand, Srini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

The companion paper can be found here: doi.org/10.5281/zenodo.814979

Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Steam Games from 2013 to 2023
kaggle.com
zip
Updated Jan 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Terenci Claramunt (2024). Steam Games from 2013 to 2023 [Dataset]. https://www.kaggle.com/terencicp/steam-games-december-2023
Explore at:
zip(6442898 bytes)Available download formats
Dataset updated
Jan 7, 2024
Authors
Terenci Claramunt
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This is a derivative dataset created for my Tableau visualisation project. It's derived from two other datasets on Kaggle:

Steam Games Dataset by Martin Bustos

Video Games on Steam [in JSON] by Sujay Kapadnis

From the Martin Bustos dataset, I removed the games without reviews and selected the most relevant features to create the following dashboard:

https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2Fce81900b3761554ce9acfc7ef25189b6%2Fsteam-dashboard.png?generation=1704630691045231&alt=media" alt="">

From the Sujay Kapadnis dataset I added the data on game duration from HowLongToBeat.com

The following notebooks contain exploratory data analysis and the transformations I used to generate this dataset from the two original datasets:

Steam Games - Exploratory Data Analysis

Steam Games - Data Transformation

View the live dashboard on Tableau Public:

Steam tag explorer
Rural Route Nomad Photo and Video Collection Dataset
zenodo.org
csv
Updated Jul 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alan Webber; Alan Webber (2022). Rural Route Nomad Photo and Video Collection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6818292
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6818292
Dataset updated
Jul 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Alan Webber; Alan Webber
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset encompasses the metadata drawn from preserving and visualizing the Rural Route Nomad Photo and Video Collection. The collection consists of 14,058 born-digital objects shot on over a dozen digital cameras in over 30 countries, on seven continents from the end of 2008 through 2009. Metadata was generated using ExifTool, along with manual means, utilizing OpenRefine and Excel to parse and clean.

The dataset was a result of an overriding project to preserve the digital content of the Rural Route Nomad Collection, and then visualize photographic specs and geographic details with charts, graphs and maps in Tableau. A description of the project as a whole is publicly forthcoming. Visualizations can be found at https://public.tableau.com/app/profile/alan.webber5364.
Artstation
kaggle.com
zip
Updated May 28, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dmitriy Zub (2021). Artstation [Dataset]. https://www.kaggle.com/dimitryzub/artstation
Explore at:
zip(4067138 bytes)Available download formats
Dataset updated
May 28, 2021
Authors
Dmitriy Zub
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Contains links only as this script to extract data was used for a freelance project.

Content

100.000 artwork links (just links). 50.000 artworks were scraped and contain data, ~40.000+ unique (artwork from the same artist).

Context

While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection with.

Data set includes such columns: - Role - Company work at (if mentioned or extracted) - Date artwork was posted - Number of views - Number of likes - Number of comments - Which software was used - Which tags were used - Artwork title - Artwork URL

As you see the disclaimer, it's the first time I'm doing this. I want anyone who will be using this dataset to keep artists privacy by not using artist's email addresses in any way even though it's publicly available data published by them. Correct me if I said something wrong here.

Code

Code that used to extract data from the Artstation you can find here, in the GitHub repository.

Inspiration

While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection. I really enjoyed seeing progression in the 3D world (games, feature films, etc).

Goals

The goal of this project was to better understand the process of gathering data, processing, cleaning, analyzing, and visualizing. Besides that, I wanted to understand what is the most popular software, tag, affiliation among artists.

Tools used

To scrape data these Python libraries/packages were used: - requests - json - googlesheets api - selenium - regex

To clean, analyze and visualize data: - googlesheets - tableau

Visualization

Note: following visualizations contains data bias. Not every tag, affiliation has taken to count due to the difficulties of data extraction, and the mistakes I made.

Tableau public dashboard

https://user-images.githubusercontent.com/78694043/119978304-23cb0380-bfc2-11eb-8b70-e84100fa7630.png" alt="image">

https://user-images.githubusercontent.com/78694043/119978269-1ada3200-bfc2-11eb-981f-b8ad2c2c0ff1.png" alt="image">

https://user-images.githubusercontent.com/78694043/119978237-101f9d00-bfc2-11eb-9285-e0d9bcf688ee.png" alt="image">
DA Analyst Capstone Project
kaggle.com
zip
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tara Jacobs (2024). DA Analyst Capstone Project [Dataset]. https://www.kaggle.com/datasets/tarajacobs/mock-user-profiles-from-social-networks
Explore at:
zip(8714 bytes)Available download formats
Dataset updated
May 18, 2024
Authors
Tara Jacobs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8"> https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618"> https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">

https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37"> https://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">

BigQuery | Big Query data Cleaning

Tableau | Creating Visuals with Tableau

Sheets | Cleaning NULL Values , creating data tables

R studio | Organizing and cleaning data to create a visual code

SQL SSMS | Transform, clean and manipulate Data

Linkedin | Survey Poll

https://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641"> https://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">

Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site

https://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d"> https://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087"> https://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270"> https://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985"> https://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497"> https://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">
Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)
kaggle.com
zip
Updated Nov 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yeshang Upadhyay (2025). Divvy Trips Clean Dataset (Nov 2024 – Oct 2025) [Dataset]. https://www.kaggle.com/datasets/yeshangupadhyay/divvy-trips-clean-dataset-nov-2024-oct-2025
Explore at:
zip(170259034 bytes)Available download formats
Dataset updated
Nov 14, 2025
Authors
Yeshang Upadhyay
License
ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
Description
📌 Overview

This dataset contains a cleaned and transformed version of the public Divvy Bicycle Sharing Trip Data covering the period November 2024 to October 2025.

The original raw data is publicly released by the Chicago Open Data Portal, and has been cleaned using Pandas (Python) and DuckDB SQL for faster analysis.
This dataset is now ready for direct use in: - Exploratory Data Analysis (EDA) - SQL analytics - Machine learning - Time-series/trend analysis - Dashboard creation (Power BI / Tableau)

📂 Source

Original Data Provider:
Chicago Open Data Portal – Divvy Trips
License: Open Data Commons Public Domain Dedication (PDDL)
This cleaned dataset only contains transformations; no proprietary or restricted data is included.

🔧 Cleaning & Transformations Performed

Combined monthly CSVs (Nov 2024 → Oct 2025)

Removed duplicates

Standardized datetime formats

Created new fields:

ride_length

day_of_week

hour_of_day

Handled missing or null values

Cleaned inconsistent station names

Filtered invalid ride durations (negative or zero-length rides)

Exported as a compressed .csv for optimized performance

📊 Columns in the Dataset

ride_id

rideable_type

started_at

ended_at

start_station_name

end_station_name

start_lat

start_lng

end_lat

end_lng

member_casual

ride_length (minutes)

day_of_week

hour_of_day

💡 Use Cases

This dataset is suitable for: - DuckDB + SQL analytics - Pandas EDA - Visualization in Power BI, Tableau, Looker - Statistical analysis - Member vs. Casual rider behavioral analysis - Peak usage prediction

📝 Notes

This dataset is not the official Divvy dataset, but a cleaned, transformed, and analysis-ready version created for educational and analytical use.
divvy's Trip (Cyclist bike share analysis)
kaggle.com
zip
Updated Apr 10, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
katabathina jyoshnavi (2024). divvy's Trip (Cyclist bike share analysis) [Dataset]. https://www.kaggle.com/datasets/katabathinajyoshnavi/divvys-trip-cyclist-bike-share-analysis
Explore at:
zip(194213174 bytes)Available download formats
Dataset updated
Apr 10, 2024
Authors
katabathina jyoshnavi
Description
Introduction:

About the Company:

Cyclistic is a bike-sharing company in Chicago, which has since expanded to include a fleet of 5,824 geotracked bicycles stationed at 692 locations across Chicago. The bikes can be unlocked at one station and returned to any other station within the network at any time. Individuals buying single-ride or full-day passes fall into the category of casual riders, while those acquiring annual memberships become recognized as Cyclistic members. Tools and Technologies: ⦁ Tableau/Power BI for dashboard development. ⦁ Python for data analysis

Phase 1: About the Dataset: The data is publicly available on an AWS server. We were tasked to work with an entire year of data, so I downloaded zipped files (CSV format) containing data from January 2023 to December 2023, one file for each month. Data Structure: Each .csv file contains a table with 13 columns with varying data types as shown below. Each column stands for a field that describes how people use Cyclist's bike-sharing service. Each row represents an observation with the details of every ride. ⦁ ride_id: This is a unique identifier assigned to each bike ride. It's like a reference number for the trip. ⦁ rideable_type: This column indicates the type of bike used in the ride. It can be "electric_bike" or "classic_bike". ⦁ started_at: This shows the date and time when the ride began. The format is YYYY-MM-DD HH:MM:SS. ⦁ ended_at: This shows the date and time when the ride ended. The format is the same as the started_at column. ⦁ start_station_name: This specifies the name of the docking station where the ride started. ⦁ start_station_id: This is a unique identifier for the starting docking station. It complements the start_station_name. ⦁ start_lat: This represents the latitude coordinate of the starting docking station. ⦁ start_lng: This represents the longitude coordinate of the starting docking station. These coordinates might be useful for mapping the station's location. ⦁ end_station_name: This specifies the name of the docking station where the ride ended. ⦁ end_station_id: This is a unique identifier for the ending docking station. It complements the end_station_name. ⦁ end_lat: This represents the latitude coordinate of the ending docking station. ⦁ end_lng: This represents the longitude coordinate of the ending docking station. These coordinates might be useful for mapping the station's location. ⦁ member_casual: This column indicates whether the rider was a member (member) or a casual user (casual) of the bike-sharing service. Phase 2: I used python for data cleaning You can view the Jupyter Notebook for the Process phase here Here are the steps that I did during this phase ⦁ Check for null and duplicates ⦁ Additional columns and data transformation (change the data type, remove trailing or leading spaces, etc.) ⦁ Extract data for analysis Data Cleaning Result Total Row Count before data cleaning: 5745324 Total Row Count before data cleaning: 4268747

Phase 3: Analyze: I used Python in my jupyter notebook to look at the huge data we cleaned earlier. I came up with questions to figure out how casual riders are different from annual members. Then, I made queries to get the answers, helping us understand more and make decisions based on the data. Questions Here are the following questions we will answer in this phase: ⦁ What is the percentage of user types from total users? ⦁ Is there a bike type preferred by different user types? ⦁ Which bike type has the longest trip duration between users? ⦁ What is the average trip duration per user type? ⦁ What is the average distance traveled per user type? ⦁ What days are most users active? ⦁ What months or seasons of the year users tend to use the bike-sharing service?

I used Tableau public in making the visualization. You can view the data visualization for the Share phase here https://public.tableau.com/app/profile/katabathina.jyoshnavi/viz/divvytripvisualisation/Dashboard7.

Findings ⦁ 63% of the total Cyclistic users are annual members while 36% are casual riders. ⦁ Both annual members and casual riders prefer classic bikes. Only casual riders use docked bikes. ⦁ Generally, casual riders have the longest average ride duration (23 minutes) compared with annual members (18 minutes). ⦁ Both annual members and casual riders have almost the same average distance traveled. ⦁ Docked bikes have the longest average ride duration which only casual riders use. Classic bikes have the longest average ride duration for annual members. ⦁ Most trips are recorded on Saturday. ⦁ There are more trips during spring and at least during winter.
Top 100 TV Shows
kaggle.com
zip
Updated Jun 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jack Jae Hwan Kim (2021). Top 100 TV Shows [Dataset]. https://www.kaggle.com/jackjaehwankim/top-100-tv-shows
Explore at:
zip(2581 bytes)Available download formats
Dataset updated
Jun 27, 2021
Authors
Jack Jae Hwan Kim
Description
Context

This is my personal project which I analyzed the main factor that leads me to select the TV show. This time, I used Python for web scrapping (or known as crawling) the data from IMDb.com and used Spreadsheet to clean the dataset. Finally, I used Tableau to visualize the data.

This time, I've utilized web-crawling to build up the database. For this project, I gathered the data from the top 100 TV shows listed by the user named 'carlosotsubo' from IMDB.com.

Content

tv_show: titles

season_years: it ranges from the beginning year to the ending year.

Note: some TV shows are still ongoing.

first_season_yr: the beginning year of the season

last_season_yr: the final or ending year of the season

running_time_min: the running time of TV show per episode

genre: in this dataset, it would be the main genre

subgenre1: subgenre #1

subgenre2: subgenre #2

imdb_rating: ratings by IMDb members

watched_yn: the determining factor on whether I've watched or not.

Acknowledgements

I sincerely thank the IMDb user named, 'carlosotsubo,' for providing the list of top 100 TV shows.

Inspiration

The following questions need to be answered:

How do I choose which TV show to watch?

Does running time also affect my decision to watch the show?

If not, would the genre be the main factor that affects my decision?

Data Visualization

After my own analysis, I've created the data visualization:

https://public.tableau.com/app/profile/jae.hwan.kim/viz/HowdoIchoosewhichTVshowtowatch/Dashboard1

If you guys give me feedback, I will be glad to hear! Thanks!
Ghana Health Facilities
kaggle.com
zip
Updated Sep 3, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
citizen datascience ghana (2018). Ghana Health Facilities [Dataset]. https://www.kaggle.com/citizen-ds-ghana/health-facilities-gh
Explore at:
zip(86057 bytes)Available download formats
Dataset updated
Sep 3, 2018
Dataset authored and provided by
citizen datascience ghana
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Area covered
Ghana
Description
Context

This dataset is provided as part of Citizen Data Science project, to gather & provide fairly clean data (which is a challenge in these regions) to support the data science practice in Ghana and other regions at the beginning of their data science learning curve. So you support is welcome

This dataset provides a listing of healthcare facilities in Ghana, by exploring it we gain new understanding of the country's health infrastructure.

Content

This dataset contains information about health facilities in Ghana organised by Region and District. It also includes the type of health facility and the ownership as well as its geo-location.

Dataset Usecases (are you up to the task? try any of below)

1. Learning/familiarisation with cleaning data and resolving in challenging context of data acquisition.

2. Understanding Ghana's health infrastructure

3. Complex Joint of health facilities and tier data
The health facilities data and the tier data are from difference sources but we like to join them because they refer to the same facility however this join may not be a simple join because the health facility name in both dataset are not exact a string match.

4. Understanding the level of access to facilities
Combined with population data we want to understand whether some regions or areas are deprived?

Any other creative stuff you can do with this data

Inspiration

Exploratory Analysis by: easimadi

Curating Ghana's Health Facilities Dataset by hisham

Understanding Ghana's Health Infrastructure - Tableau visualisation

Acknowledgements

accessed: http://data.gov.gh/dataset/health-facilities
source: http://www.moh-ghana.org/

by: easimadi
Top 100 Bollywood IMDb movies by genres
kaggle.com
zip
Updated Oct 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
abhimech_008 (2022). Top 100 Bollywood IMDb movies by genres [Dataset]. https://www.kaggle.com/datasets/abhimech008/top-100-bollywood-imdb-movies-by-genres/versions/1
Explore at:
zip(6093 bytes)Available download formats
Dataset updated
Oct 12, 2022
Authors
abhimech_008
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains information about the top 100 highest grossing Bollywood films. It is up to date as of 10th January 2022.

Acknowledgements

This data has been scraped from IMDb, the link has been added. EDA was performed on the dataset to fill out the missing values and clean the data. The csv file provided is ready to use for visualizations. I have already visualized this dataset using Tableau which you can check on my Tableau profile.(https://public.tableau.com/app/profile/abhishek.verma6495)

If you wish to contribute to this dataset. Do contact me :)
Smallmouth Bass State Records
kaggle.com
zip
Updated Oct 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Taylor (2023). Smallmouth Bass State Records [Dataset]. https://www.kaggle.com/datasets/treddson/smalllmouth-bass-state-records
Explore at:
zip(1570 bytes)Available download formats
Dataset updated
Oct 24, 2023
Authors
Taylor
Description
I have gathered data on state records of Smallmouth Bass. I am an avid angler and especially love catching Smallmouth bass in the clean, deep lakes, here in the pacific northwest.

Gathering as accurate data as possible on this topic was not an easy task. I became very familiar with the data cleaning process despite this being an incredibly small dataset. Much of the cleaning involved correcting dates and ensuring the information was up-to-date.

As far as the visualization of this data, it's rather straightforward. The labels include the individual's name who caught the fish, the state and fishery where the fish was caught, the weight of the fish, the year the fish was caught.

The size of the data points for each state is determined by the weight of the fish and the darker the color, the larger the fish.
EPL Player Transfer Data from 2014-15_2018-19
kaggle.com
zip
Updated Nov 24, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandman (2019). EPL Player Transfer Data from 2014-15_2018-19 [Dataset]. https://www.kaggle.com/sandipanchakraborty/epl-player-transfer-data-from-201415-201819
Explore at:
zip(247272 bytes)Available download formats
Dataset updated
Nov 24, 2019
Authors
Sandman
Description
This dataset contains exactly what the heading says and some more! I am a fan of the English Premier League. A few months back I started wondering whether I could find any transfer data so that I could find some patterns in it. To my dismay, there's no free source available. That's when I started digging into Wikipedia. All the data I have attained are from Wikipedia.

As stated already, all the data has been obtained from Wikipedia, hence there are a few inconsistencies, which can be sorted easily. The data is from 2014-15 Season to 2018-19 Season (till Summer). I created separate files for it initially and then used Alteryx to clean the data and made some adjustments and finally append it in one file.

My inspiration was to see the patterns in the transfer system of a particular club. Since, this was a big dataset, I tried to make an initial attempt, which can be seen by clicking on this link.

However, can this date be used to predict the spending of a club???
Divvy Bikeshare Data | April 2020 - May 2021
kaggle.com
Updated Aug 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Antoni K Pestka (2021). Divvy Bikeshare Data | April 2020 - May 2021 [Dataset]. https://www.kaggle.com/antonikpestka/divvy-bikeshare-data-april-2020-may-2021/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Aug 21, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Antoni K Pestka
Description
Original Divvy Bikeshare Data obtained from here

City of Chicago Zip Code Boundary Data obtained from here

Tableau Dashboard Viz can be seen here

R code can be found here

Context

This is my first-ever project after recently completing the Google Data Analytics Certificate on Coursera.

The goal of the project are to answer the following questions: 1. How do annual riders and casual riders use Divvy bikeshare differently? 2. Why would casual riders buy annual memberships? 3. How can Divvy use digital media to influence casual riders to become members?

Casual riders are defined as those who do not have an annual membership, and instead use the service on a a pay-per-ride basis.

Content

Original Divvy Bikeshare Data obtained from here

The original datasets included the following columns: Ride ID # Rideable Type (electric, docked bike, classic) Started At Date/Time Ended At Date/Time Start Station Address Start Station ID End Station Address End Station ID Start Longitude Start Latitude End Longitude End Latitude Member Type (member, casual)

City of Chicago Zip Code Boundary Data obtained from here

The zip code boundary geospatial files were used to calculate the zip code of trip origin for each trip based on start longitude and start latitude.

Caveats and Assumptions

Divvy utilizes two types of bicycles: electric bicycles and classic bicycles. For the column labeled "rideable_type", three values existed: docked_bike, electric_bike, and classic. Docked_bike and classic were aggregated into the same category. Therefore, they are labeled as "other" on the visualization.

Negative ride lengths and ride lengths under 90 seconds in length were not included in the calculation of average ride length. -Negative ride lengths exist due to the end time and date being recorded as occurring BEFORE the start time and date on certain data entries. -Ride lengths 90 seconds and less were ruled out due to the possibility of bikes failing to dock properly or being checked out for a short time for maintenance checks. -This removed 90,842 records from the calculations for average ride length.

The process

R programming language was used for the following:

Create a new column for the zip code of each trip origin based on the start longitude and start latitude

Calculate the ride length in seconds for each trip

Remove unnecessary columns

Rename "electric_bike" to EL and "docked_bike" to DB

The R code I utilized is found here

Excel was used for the following:

Deletion of header rows for all dataset files except for the first file (April 2020)

Deletion of the geometry information to save file space

A .bat file utilizing DOS command line was utilized to merged all the cleaned CSV files into a single file.

Finally, the cleaned and merged dataset was connected to Tableau for analysis and visualization. A link to the the dashboard can be found here

Data Analysis Overview

Zip Code with highest quantity of trips: 60614 (615,010) Total Quantity of Zip Codes: 56 Trip Quantity of Top 9 Zip Codes: 60.35% (2,630,330) Trip Quantity of the Remaining 47 Zip Codes: 39.65% (1,728,281)

Total Quantity of Trips: 4,358,611 Quantity of Trips by Annual Members: 58.15% (2,534,718) Quantity of Trips by Casual Members: 41.85% (1,823,893)

Average Ride Length with Electric Bicycle: Annual Members: 13.8 minutes Casual Members: 22.3 minutes

Average Ride Length with Classic Bicycle: Annual Members: 16.8 minutes Casual Members: 49.7 minutes

Average Ride Length Overall: Annual Members: 16.2 minutes Casual Members: 44.2 minutes

Peak Day of the Week for Overall Trip Quantity: Annual Members: Saturday Casual Members: Saturday

Slowest Day of the Week for Overall Trip Quantity: Tuesday Annual Members: Sunday Casual Members: Tuesday

Peak Day of the Week for Electric Bikes: Saturday Annual Members: Saturday Casual Members: Saturday

Slowest Day of the Week for Electric Bikes: Tuesday Annual Members: Sunday Casual Members: Tuesday

Peak day of the Week for Classic Bikes: Saturday Ann...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Explore at:

zip(276607 bytes)Available download formats

Dataset updated

Aug 26, 2022

Authors

Abdulrasaq Ariyo

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Data Cleaning

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates

--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;

We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast

UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...

Clear search

Close search

Google apps

Main menu

Netflix Data: Cleaning, Analysis and Visualization

Data Cleaning

Stock Market Dashboard Build (Python + Tableau)

Visualizing Chicago Crime Data

Prelude

About the Dataset

Important Facts

Reliability

Processing the Data

Cleaning the Dataset

Examining the dataset

There are over 7.5 million rows of data

Putting a limit so it does not take a long time to run

Seeing which points are null

There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

Most of the null points are in the lat and long, which we will need later

Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

Deleting all null rows

Checking for any duplicates in the unique keys

None to be found

IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

Steam Games from 2013 to 2023

Rural Route Nomad Photo and Video Collection Dataset

Artstation

Content

Context

Code

Inspiration

Goals

Tools used

Visualization

DA Analyst Capstone Project

Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)

📌 Overview

📂 Source

🔧 Cleaning & Transformations Performed

📊 Columns in the Dataset

💡 Use Cases

📝 Notes

divvy's Trip (Cyclist bike share analysis)

Top 100 TV Shows

Context

Content

Acknowledgements

Inspiration

Data Visualization

Ghana Health Facilities

Context

Content

Dataset Usecases (are you up to the task? try any of below)

Inspiration

Acknowledgements

Top 100 Bollywood IMDb movies by genres

Context

Acknowledgements

Smallmouth Bass State Records

EPL Player Transfer Data from 2014-15_2018-19

Divvy Bikeshare Data | April 2020 - May 2021

Context

Content

Caveats and Assumptions

The process

R programming language was used for the following:

Excel was used for the following:

Data Analysis Overview

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Data Cleaning