16 datasets found
  1. Netflix Data: Cleaning, Analysis and Visualization

    • kaggle.com
    zip
    Updated Aug 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
    Explore at:
    zip(276607 bytes)Available download formats
    Dataset updated
    Aug 26, 2022
    Authors
    Abdulrasaq Ariyo
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

    Data Cleaning

    We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

    --View dataset
    
    SELECT * 
    FROM netflix;
    
    
    --The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                      
    SELECT show_id, COUNT(*)                                                                                      
    FROM netflix 
    GROUP BY show_id                                                                                              
    ORDER BY show_id DESC;
    
    --No duplicates
    
    --Check null values across columns
    
    SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
        COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
        COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
        COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
        COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
        COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
        COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
        COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
        COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
        COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
        COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
        COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
    FROM netflix;
    
    We can see that there are NULLS. 
    director_nulls = 2634
    movie_cast_nulls = 825
    country_nulls = 831
    date_added_nulls = 10
    rating_nulls = 4
    duration_nulls = 3 
    

    The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

    -- Below, we find out if some directors are likely to work with particular cast
    
    WITH cte AS
    (
    SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
    FROM netflix
    )
    
    SELECT director_cast, COUNT(*) AS count
    FROM cte
    GROUP BY director_cast
    HAVING COUNT(*) > 1
    ORDER BY COUNT(*) DESC;
    
    With this, we can now populate NULL rows in directors 
    using their record with movie_cast 
    
    UPDATE netflix 
    SET director = 'Alastair Fothergill'
    WHERE movie_cast = 'David Attenborough'
    AND director IS NULL ;
    
    --Repeat this step to populate the rest of the director nulls
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET director = 'Not Given'
    WHERE director IS NULL;
    
    --When I was doing this, I found a less complex and faster way to populate a column which I will use next
    

    Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

    --Populate the country using the director column
    
    SELECT COALESCE(nt.country,nt2.country) 
    FROM netflix AS nt
    JOIN netflix AS nt2 
    ON nt.director = nt2.director 
    AND nt.show_id <> nt2.show_id
    WHERE nt.country IS NULL;
    UPDATE netflix
    SET country = nt2.country
    FROM netflix AS nt2
    WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
    AND netflix.country IS NULL;
    
    
    --To confirm if there are still directors linked to country that refuse to update
    
    SELECT director, country, date_added
    FROM netflix
    WHERE country IS NULL;
    
    --Populate the rest of the NULL in director as "Not Given"
    
    UPDATE netflix 
    SET country = 'Not Given'
    WHERE country IS NULL;
    

    The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

    --Show date_added nulls
    
    SELECT show_id, date_added
    FROM netflix_clean
    WHERE date_added IS NULL;
    
    --DELETE nulls
    
    DELETE F...
    
  2. Stock Market Dashboard Build (Python + Tableau)

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jackmnob (2025). Stock Market Dashboard Build (Python + Tableau) [Dataset]. https://www.kaggle.com/datasets/jackmnob/stock-market-dashboard-build-python-tableau
    Explore at:
    zip(549379249 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    jackmnob
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Original Credit goes to: Oleh Onyshchak

    Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download

    rawData (.CSVs) Information:

    "This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.

    The historic data was retrieved from Yahoo finance via yfinance python package."

    Each file contains data from 01/04/2016 to 04/01/2020.

    cleanData (.CSVs) & .ipynb (Python code) Information:

    This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:

    My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1

    Enjoy!

  3. Visualizing Chicago Crime Data

    • kaggle.com
    zip
    Updated Jul 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Elijah Toumoua (2022). Visualizing Chicago Crime Data [Dataset]. https://www.kaggle.com/datasets/elijahtoumoua/chicago-analysis-of-crime-data-dashboard
    Explore at:
    zip(94861784 bytes)Available download formats
    Dataset updated
    Jul 1, 2022
    Authors
    Elijah Toumoua
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Area covered
    Chicago
    Description

    Prelude

    This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.

    About the Dataset

    Important Facts

    The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.

    Reliability

    Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.

    Processing the Data

    Cleaning the Dataset

    The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.

    The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```

    Examining the dataset

    There are over 7.5 million rows of data

    Putting a limit so it does not take a long time to run

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime LIMIT 1000;

    Seeing which points are null

    There are 85,000 null points so we can exclude them as it's not a significant amount since it is only ~1.3% of the dataset

    Most of the null points are in the lat and long, which we will need later

    Because we don't have the full address, we can't estimate the lat and long in SQL so we will have to delete the rows with Null Data

    SELECT * FROM portfolioproject-350601.ChicagoCrime.Crime WHERE unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Deleting all null rows

    DELETE FROM portfolioproject-350601.ChicagoCrime.Crime WHERE
    unique_key IS NULL OR case_number IS NULL OR date IS NULL OR primary_type IS NULL OR location_description IS NULL OR arrest IS NULL OR longitude IS NULL OR latitude IS NULL;

    Checking for any duplicates in the unique keys

    None to be found

    SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....

  4. Z

    IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal...

    • data.niaid.nih.gov
    • nde-dev.biothings.io
    • +2more
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cains, Mariana; Anand, Srini (2020). IVMOOC 2017 - GloBI Data for Interactive Tableau Map of Spatial and Temporal Distribution of Interactions [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_814911
    Explore at:
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Indiana University
    Authors
    Cains, Mariana; Anand, Srini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.

    Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.

    The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm

    The companion paper can be found here: doi.org/10.5281/zenodo.814979

    Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922

    Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)

  5. Steam Games from 2013 to 2023

    • kaggle.com
    zip
    Updated Jan 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Terenci Claramunt (2024). Steam Games from 2013 to 2023 [Dataset]. https://www.kaggle.com/terencicp/steam-games-december-2023
    Explore at:
    zip(6442898 bytes)Available download formats
    Dataset updated
    Jan 7, 2024
    Authors
    Terenci Claramunt
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This is a derivative dataset created for my Tableau visualisation project. It's derived from two other datasets on Kaggle:

    Steam Games Dataset by Martin Bustos

    Video Games on Steam [in JSON] by Sujay Kapadnis

    From the Martin Bustos dataset, I removed the games without reviews and selected the most relevant features to create the following dashboard:

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2Fce81900b3761554ce9acfc7ef25189b6%2Fsteam-dashboard.png?generation=1704630691045231&alt=media" alt="">

    From the Sujay Kapadnis dataset I added the data on game duration from HowLongToBeat.com

    The following notebooks contain exploratory data analysis and the transformations I used to generate this dataset from the two original datasets:

    Steam Games - Exploratory Data Analysis

    Steam Games - Data Transformation

    View the live dashboard on Tableau Public:

    Steam tag explorer

  6. Rural Route Nomad Photo and Video Collection Dataset

    • zenodo.org
    csv
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alan Webber; Alan Webber (2022). Rural Route Nomad Photo and Video Collection Dataset [Dataset]. http://doi.org/10.5281/zenodo.6818292
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jul 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Alan Webber; Alan Webber
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset encompasses the metadata drawn from preserving and visualizing the Rural Route Nomad Photo and Video Collection. The collection consists of 14,058 born-digital objects shot on over a dozen digital cameras in over 30 countries, on seven continents from the end of 2008 through 2009. Metadata was generated using ExifTool, along with manual means, utilizing OpenRefine and Excel to parse and clean.

    The dataset was a result of an overriding project to preserve the digital content of the Rural Route Nomad Collection, and then visualize photographic specs and geographic details with charts, graphs and maps in Tableau. A description of the project as a whole is publicly forthcoming. Visualizations can be found at https://public.tableau.com/app/profile/alan.webber5364.

  7. Artstation

    • kaggle.com
    zip
    Updated May 28, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmitriy Zub (2021). Artstation [Dataset]. https://www.kaggle.com/dimitryzub/artstation
    Explore at:
    zip(4067138 bytes)Available download formats
    Dataset updated
    May 28, 2021
    Authors
    Dmitriy Zub
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Contains links only as this script to extract data was used for a freelance project.

    Content

    100.000 artwork links (just links). 50.000 artworks were scraped and contain data, ~40.000+ unique (artwork from the same artist).

    Context

    While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection with.

    Data set includes such columns: - Role - Company work at (if mentioned or extracted) - Date artwork was posted - Number of views - Number of likes - Number of comments - Which software was used - Which tags were used - Artwork title - Artwork URL

    As you see the disclaimer, it's the first time I'm doing this. I want anyone who will be using this dataset to keep artists privacy by not using artist's email addresses in any way even though it's publicly available data published by them. Correct me if I said something wrong here.

    Code

    Code that used to extract data from the Artstation you can find here, in the GitHub repository.

    Inspiration

    While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection. I really enjoyed seeing progression in the 3D world (games, feature films, etc).

    Goals

    The goal of this project was to better understand the process of gathering data, processing, cleaning, analyzing, and visualizing. Besides that, I wanted to understand what is the most popular software, tag, affiliation among artists.

    Tools used

    To scrape data these Python libraries/packages were used: - requests - json - googlesheets api - selenium - regex

    To clean, analyze and visualize data: - googlesheets - tableau

    Visualization

    Note: following visualizations contains data bias. Not every tag, affiliation has taken to count due to the difficulties of data extraction, and the mistakes I made.

    Tableau public dashboard

    https://user-images.githubusercontent.com/78694043/119978304-23cb0380-bfc2-11eb-8b70-e84100fa7630.png" alt="image">

    https://user-images.githubusercontent.com/78694043/119978269-1ada3200-bfc2-11eb-981f-b8ad2c2c0ff1.png" alt="image">

    https://user-images.githubusercontent.com/78694043/119978237-101f9d00-bfc2-11eb-9285-e0d9bcf688ee.png" alt="image">

  8. DA Analyst Capstone Project

    • kaggle.com
    zip
    Updated May 18, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tara Jacobs (2024). DA Analyst Capstone Project [Dataset]. https://www.kaggle.com/datasets/tarajacobs/mock-user-profiles-from-social-networks
    Explore at:
    zip(8714 bytes)Available download formats
    Dataset updated
    May 18, 2024
    Authors
    Tara Jacobs
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Screenshot 2024-05-18 213045https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8"> Screenshot 2024-05-18 213108https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618"> Screenshot 2024-05-18 213137https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">

    Screenshot 2024-05-19 090932https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">

    BigQuery | Big Query data Cleaning

    Tableau | Creating Visuals with Tableau

    Sheets | Cleaning NULL Values , creating data tables

    R studio | Organizing and cleaning data to create a visual code

    SQL SSMS | Transform, clean and manipulate Data

    Linkedin | Survey Poll

    imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">

    Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site

    imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305"> imagehttps://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">

  9. Divvy Trips Clean Dataset (Nov 2024 – Oct 2025)

    • kaggle.com
    zip
    Updated Nov 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yeshang Upadhyay (2025). Divvy Trips Clean Dataset (Nov 2024 – Oct 2025) [Dataset]. https://www.kaggle.com/datasets/yeshangupadhyay/divvy-trips-clean-dataset-nov-2024-oct-2025
    Explore at:
    zip(170259034 bytes)Available download formats
    Dataset updated
    Nov 14, 2025
    Authors
    Yeshang Upadhyay
    License

    ODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
    License information was derived automatically

    Description

    📌 Overview

    This dataset contains a cleaned and transformed version of the public Divvy Bicycle Sharing Trip Data covering the period November 2024 to October 2025.

    The original raw data is publicly released by the Chicago Open Data Portal, and has been cleaned using Pandas (Python) and DuckDB SQL for faster analysis.
    This dataset is now ready for direct use in: - Exploratory Data Analysis (EDA) - SQL analytics - Machine learning - Time-series/trend analysis - Dashboard creation (Power BI / Tableau)

    📂 Source

    Original Data Provider:
    Chicago Open Data Portal – Divvy Trips
    License: Open Data Commons Public Domain Dedication (PDDL)
    This cleaned dataset only contains transformations; no proprietary or restricted data is included.

    🔧 Cleaning & Transformations Performed

    • Combined monthly CSVs (Nov 2024 → Oct 2025)
    • Removed duplicates
    • Standardized datetime formats
    • Created new fields:
      • ride_length
      • day_of_week
      • hour_of_day
    • Handled missing or null values
    • Cleaned inconsistent station names
    • Filtered invalid ride durations (negative or zero-length rides)
    • Exported as a compressed .csv for optimized performance

    📊 Columns in the Dataset

    • ride_id
    • rideable_type
    • started_at
    • ended_at
    • start_station_name
    • end_station_name
    • start_lat
    • start_lng
    • end_lat
    • end_lng
    • member_casual
    • ride_length (minutes)
    • day_of_week
    • hour_of_day

    💡 Use Cases

    This dataset is suitable for: - DuckDB + SQL analytics - Pandas EDA - Visualization in Power BI, Tableau, Looker - Statistical analysis - Member vs. Casual rider behavioral analysis - Peak usage prediction

    📝 Notes

    This dataset is not the official Divvy dataset, but a cleaned, transformed, and analysis-ready version created for educational and analytical use.

  10. divvy's Trip (Cyclist bike share analysis)

    • kaggle.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    katabathina jyoshnavi (2024). divvy's Trip (Cyclist bike share analysis) [Dataset]. https://www.kaggle.com/datasets/katabathinajyoshnavi/divvys-trip-cyclist-bike-share-analysis
    Explore at:
    zip(194213174 bytes)Available download formats
    Dataset updated
    Apr 10, 2024
    Authors
    katabathina jyoshnavi
    Description

    Introduction:

    About the Company:

    Cyclistic is a bike-sharing company in Chicago, which has since expanded to include a fleet of 5,824 geotracked bicycles stationed at 692 locations across Chicago. The bikes can be unlocked at one station and returned to any other station within the network at any time. Individuals buying single-ride or full-day passes fall into the category of casual riders, while those acquiring annual memberships become recognized as Cyclistic members. Tools and Technologies: ⦁ Tableau/Power BI for dashboard development. ⦁ Python for data analysis

    Phase 1: About the Dataset: The data is publicly available on an AWS server. We were tasked to work with an entire year of data, so I downloaded zipped files (CSV format) containing data from January 2023 to December 2023, one file for each month. Data Structure: Each .csv file contains a table with 13 columns with varying data types as shown below. Each column stands for a field that describes how people use Cyclist's bike-sharing service. Each row represents an observation with the details of every ride. ⦁ ride_id: This is a unique identifier assigned to each bike ride. It's like a reference number for the trip. ⦁ rideable_type: This column indicates the type of bike used in the ride. It can be "electric_bike" or "classic_bike". ⦁ started_at: This shows the date and time when the ride began. The format is YYYY-MM-DD HH:MM:SS. ⦁ ended_at: This shows the date and time when the ride ended. The format is the same as the started_at column. ⦁ start_station_name: This specifies the name of the docking station where the ride started. ⦁ start_station_id: This is a unique identifier for the starting docking station. It complements the start_station_name. ⦁ start_lat: This represents the latitude coordinate of the starting docking station. ⦁ start_lng: This represents the longitude coordinate of the starting docking station. These coordinates might be useful for mapping the station's location. ⦁ end_station_name: This specifies the name of the docking station where the ride ended. ⦁ end_station_id: This is a unique identifier for the ending docking station. It complements the end_station_name. ⦁ end_lat: This represents the latitude coordinate of the ending docking station. ⦁ end_lng: This represents the longitude coordinate of the ending docking station. These coordinates might be useful for mapping the station's location. ⦁ member_casual: This column indicates whether the rider was a member (member) or a casual user (casual) of the bike-sharing service. Phase 2: I used python for data cleaning You can view the Jupyter Notebook for the Process phase here Here are the steps that I did during this phase ⦁ Check for null and duplicates ⦁ Additional columns and data transformation (change the data type, remove trailing or leading spaces, etc.) ⦁ Extract data for analysis Data Cleaning Result Total Row Count before data cleaning: 5745324 Total Row Count before data cleaning: 4268747

    Phase 3: Analyze: I used Python in my jupyter notebook to look at the huge data we cleaned earlier. I came up with questions to figure out how casual riders are different from annual members. Then, I made queries to get the answers, helping us understand more and make decisions based on the data. Questions Here are the following questions we will answer in this phase: ⦁ What is the percentage of user types from total users? ⦁ Is there a bike type preferred by different user types? ⦁ Which bike type has the longest trip duration between users? ⦁ What is the average trip duration per user type? ⦁ What is the average distance traveled per user type? ⦁ What days are most users active? ⦁ What months or seasons of the year users tend to use the bike-sharing service?

    I used Tableau public in making the visualization. You can view the data visualization for the Share phase here https://public.tableau.com/app/profile/katabathina.jyoshnavi/viz/divvytripvisualisation/Dashboard7.

    Findings ⦁ 63% of the total Cyclistic users are annual members while 36% are casual riders. ⦁ Both annual members and casual riders prefer classic bikes. Only casual riders use docked bikes. ⦁ Generally, casual riders have the longest average ride duration (23 minutes) compared with annual members (18 minutes). ⦁ Both annual members and casual riders have almost the same average distance traveled. ⦁ Docked bikes have the longest average ride duration which only casual riders use. Classic bikes have the longest average ride duration for annual members. ⦁ Most trips are recorded on Saturday. ⦁ There are more trips during spring and at least during winter.

  11. Top 100 TV Shows

    • kaggle.com
    zip
    Updated Jun 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jack Jae Hwan Kim (2021). Top 100 TV Shows [Dataset]. https://www.kaggle.com/jackjaehwankim/top-100-tv-shows
    Explore at:
    zip(2581 bytes)Available download formats
    Dataset updated
    Jun 27, 2021
    Authors
    Jack Jae Hwan Kim
    Description

    Context

    This is my personal project which I analyzed the main factor that leads me to select the TV show. This time, I used Python for web scrapping (or known as crawling) the data from IMDb.com and used Spreadsheet to clean the dataset. Finally, I used Tableau to visualize the data.

    This time, I've utilized web-crawling to build up the database. For this project, I gathered the data from the top 100 TV shows listed by the user named 'carlosotsubo' from IMDB.com.

    Content

    1. tv_show: titles
    2. season_years: it ranges from the beginning year to the ending year.
      • Note: some TV shows are still ongoing.
    3. first_season_yr: the beginning year of the season
    4. last_season_yr: the final or ending year of the season
    5. running_time_min: the running time of TV show per episode
    6. genre: in this dataset, it would be the main genre
    7. subgenre1: subgenre #1
    8. subgenre2: subgenre #2
    9. imdb_rating: ratings by IMDb members
    10. watched_yn: the determining factor on whether I've watched or not.

    Acknowledgements

    I sincerely thank the IMDb user named, 'carlosotsubo,' for providing the list of top 100 TV shows.

    Inspiration

    The following questions need to be answered:

    1. How do I choose which TV show to watch?
    2. Does running time also affect my decision to watch the show?
    3. If not, would the genre be the main factor that affects my decision?

    Data Visualization

    After my own analysis, I've created the data visualization:

    https://public.tableau.com/app/profile/jae.hwan.kim/viz/HowdoIchoosewhichTVshowtowatch/Dashboard1

    If you guys give me feedback, I will be glad to hear! Thanks!

  12. Ghana Health Facilities

    • kaggle.com
    zip
    Updated Sep 3, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    citizen datascience ghana (2018). Ghana Health Facilities [Dataset]. https://www.kaggle.com/citizen-ds-ghana/health-facilities-gh
    Explore at:
    zip(86057 bytes)Available download formats
    Dataset updated
    Sep 3, 2018
    Dataset authored and provided by
    citizen datascience ghana
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Area covered
    Ghana
    Description

    Context

    This dataset is provided as part of Citizen Data Science project, to gather & provide fairly clean data (which is a challenge in these regions) to support the data science practice in Ghana and other regions at the beginning of their data science learning curve. So you support is welcome

    This dataset provides a listing of healthcare facilities in Ghana, by exploring it we gain new understanding of the country's health infrastructure.

    Content

    This dataset contains information about health facilities in Ghana organised by Region and District. It also includes the type of health facility and the ownership as well as its geo-location.

    Dataset Usecases (are you up to the task? try any of below)

    1. Learning/familiarisation with cleaning data and resolving in challenging context of data acquisition.

    2. Understanding Ghana's health infrastructure

    3. Complex Joint of health facilities and tier data
    The health facilities data and the tier data are from difference sources but we like to join them because they refer to the same facility however this join may not be a simple join because the health facility name in both dataset are not exact a string match.

    4. Understanding the level of access to facilities
    Combined with population data we want to understand whether some regions or areas are deprived?

    Any other creative stuff you can do with this data

    Inspiration

    Acknowledgements

    accessed: http://data.gov.gh/dataset/health-facilities
    source: http://www.moh-ghana.org/

    by: easimadi

  13. Top 100 Bollywood IMDb movies by genres

    • kaggle.com
    zip
    Updated Oct 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abhimech_008 (2022). Top 100 Bollywood IMDb movies by genres [Dataset]. https://www.kaggle.com/datasets/abhimech008/top-100-bollywood-imdb-movies-by-genres/versions/1
    Explore at:
    zip(6093 bytes)Available download formats
    Dataset updated
    Oct 12, 2022
    Authors
    abhimech_008
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Context

    This dataset contains information about the top 100 highest grossing Bollywood films. It is up to date as of 10th January 2022.

    Acknowledgements

    This data has been scraped from IMDb, the link has been added. EDA was performed on the dataset to fill out the missing values and clean the data. The csv file provided is ready to use for visualizations. I have already visualized this dataset using Tableau which you can check on my Tableau profile.(https://public.tableau.com/app/profile/abhishek.verma6495)

    If you wish to contribute to this dataset. Do contact me :)

  14. Smallmouth Bass State Records

    • kaggle.com
    zip
    Updated Oct 24, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Taylor (2023). Smallmouth Bass State Records [Dataset]. https://www.kaggle.com/datasets/treddson/smalllmouth-bass-state-records
    Explore at:
    zip(1570 bytes)Available download formats
    Dataset updated
    Oct 24, 2023
    Authors
    Taylor
    Description

    I have gathered data on state records of Smallmouth Bass. I am an avid angler and especially love catching Smallmouth bass in the clean, deep lakes, here in the pacific northwest.

    Gathering as accurate data as possible on this topic was not an easy task. I became very familiar with the data cleaning process despite this being an incredibly small dataset. Much of the cleaning involved correcting dates and ensuring the information was up-to-date.

    As far as the visualization of this data, it's rather straightforward. The labels include the individual's name who caught the fish, the state and fishery where the fish was caught, the weight of the fish, the year the fish was caught.

    The size of the data points for each state is determined by the weight of the fish and the darker the color, the larger the fish.

  15. EPL Player Transfer Data from 2014-15_2018-19

    • kaggle.com
    zip
    Updated Nov 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sandman (2019). EPL Player Transfer Data from 2014-15_2018-19 [Dataset]. https://www.kaggle.com/sandipanchakraborty/epl-player-transfer-data-from-201415-201819
    Explore at:
    zip(247272 bytes)Available download formats
    Dataset updated
    Nov 24, 2019
    Authors
    Sandman
    Description

    This dataset contains exactly what the heading says and some more! I am a fan of the English Premier League. A few months back I started wondering whether I could find any transfer data so that I could find some patterns in it. To my dismay, there's no free source available. That's when I started digging into Wikipedia. All the data I have attained are from Wikipedia.

    As stated already, all the data has been obtained from Wikipedia, hence there are a few inconsistencies, which can be sorted easily. The data is from 2014-15 Season to 2018-19 Season (till Summer). I created separate files for it initially and then used Alteryx to clean the data and made some adjustments and finally append it in one file.

    My inspiration was to see the patterns in the transfer system of a particular club. Since, this was a big dataset, I tried to make an initial attempt, which can be seen by clicking on this link.

    However, can this date be used to predict the spending of a club???

  16. Divvy Bikeshare Data | April 2020 - May 2021

    • kaggle.com
    Updated Aug 21, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Antoni K Pestka (2021). Divvy Bikeshare Data | April 2020 - May 2021 [Dataset]. https://www.kaggle.com/antonikpestka/divvy-bikeshare-data-april-2020-may-2021/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 21, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Antoni K Pestka
    Description

    Original Divvy Bikeshare Data obtained from here

    City of Chicago Zip Code Boundary Data obtained from here

    Tableau Dashboard Viz can be seen here

    R code can be found here

    Context

    This is my first-ever project after recently completing the Google Data Analytics Certificate on Coursera.

    The goal of the project are to answer the following questions: 1. How do annual riders and casual riders use Divvy bikeshare differently? 2. Why would casual riders buy annual memberships? 3. How can Divvy use digital media to influence casual riders to become members?

    Casual riders are defined as those who do not have an annual membership, and instead use the service on a a pay-per-ride basis.

    Content

    Original Divvy Bikeshare Data obtained from here

    The original datasets included the following columns: Ride ID # Rideable Type (electric, docked bike, classic) Started At Date/Time Ended At Date/Time Start Station Address Start Station ID End Station Address End Station ID Start Longitude Start Latitude End Longitude End Latitude Member Type (member, casual)

    City of Chicago Zip Code Boundary Data obtained from here

    The zip code boundary geospatial files were used to calculate the zip code of trip origin for each trip based on start longitude and start latitude.

    Caveats and Assumptions

    1. Divvy utilizes two types of bicycles: electric bicycles and classic bicycles. For the column labeled "rideable_type", three values existed: docked_bike, electric_bike, and classic. Docked_bike and classic were aggregated into the same category. Therefore, they are labeled as "other" on the visualization.

    2. Negative ride lengths and ride lengths under 90 seconds in length were not included in the calculation of average ride length. -Negative ride lengths exist due to the end time and date being recorded as occurring BEFORE the start time and date on certain data entries. -Ride lengths 90 seconds and less were ruled out due to the possibility of bikes failing to dock properly or being checked out for a short time for maintenance checks. -This removed 90,842 records from the calculations for average ride length.

    The process

    R programming language was used for the following:

    1. Create a new column for the zip code of each trip origin based on the start longitude and start latitude
    2. Calculate the ride length in seconds for each trip
    3. Remove unnecessary columns
    4. Rename "electric_bike" to EL and "docked_bike" to DB

    The R code I utilized is found here

    Excel was used for the following:

    1. Deletion of header rows for all dataset files except for the first file (April 2020)
    2. Deletion of the geometry information to save file space

    A .bat file utilizing DOS command line was utilized to merged all the cleaned CSV files into a single file.

    Finally, the cleaned and merged dataset was connected to Tableau for analysis and visualization. A link to the the dashboard can be found here

    Data Analysis Overview

    Zip Code with highest quantity of trips: 60614 (615,010) Total Quantity of Zip Codes: 56 Trip Quantity of Top 9 Zip Codes: 60.35% (2,630,330) Trip Quantity of the Remaining 47 Zip Codes: 39.65% (1,728,281)

    Total Quantity of Trips: 4,358,611 Quantity of Trips by Annual Members: 58.15% (2,534,718) Quantity of Trips by Casual Members: 41.85% (1,823,893)

    Average Ride Length with Electric Bicycle: Annual Members: 13.8 minutes Casual Members: 22.3 minutes

    Average Ride Length with Classic Bicycle: Annual Members: 16.8 minutes Casual Members: 49.7 minutes

    Average Ride Length Overall: Annual Members: 16.2 minutes Casual Members: 44.2 minutes

    Peak Day of the Week for Overall Trip Quantity: Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Overall Trip Quantity: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak Day of the Week for Electric Bikes: Saturday Annual Members: Saturday Casual Members: Saturday

    Slowest Day of the Week for Electric Bikes: Tuesday Annual Members: Sunday Casual Members: Tuesday

    Peak day of the Week for Classic Bikes: Saturday Ann...

  17. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Abdulrasaq Ariyo (2022). Netflix Data: Cleaning, Analysis and Visualization [Dataset]. https://www.kaggle.com/datasets/ariyoomotade/netflix-data-cleaning-analysis-and-visualization
Organization logo

Netflix Data: Cleaning, Analysis and Visualization

Cleaning and Visualization with Pgsql and Tableau

Explore at:
zip(276607 bytes)Available download formats
Dataset updated
Aug 26, 2022
Authors
Abdulrasaq Ariyo
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .

Data Cleaning

We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments

--View dataset

SELECT * 
FROM netflix;

--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
                                  
SELECT show_id, COUNT(*)                                                                                      
FROM netflix 
GROUP BY show_id                                                                                              
ORDER BY show_id DESC;

--No duplicates
--Check null values across columns

SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
    COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
    COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
    COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
    COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
    COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
    COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
    COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
    COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
    COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
    COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
    COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS. 
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3 

The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column

-- Below, we find out if some directors are likely to work with particular cast

WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast 
FROM netflix
)

SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;

With this, we can now populate NULL rows in directors 
using their record with movie_cast 
UPDATE netflix 
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;

--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET director = 'Not Given'
WHERE director IS NULL;

--When I was doing this, I found a less complex and faster way to populate a column which I will use next

Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column

--Populate the country using the director column

SELECT COALESCE(nt.country,nt2.country) 
FROM netflix AS nt
JOIN netflix AS nt2 
ON nt.director = nt2.director 
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id 
AND netflix.country IS NULL;


--To confirm if there are still directors linked to country that refuse to update

SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;

--Populate the rest of the NULL in director as "Not Given"

UPDATE netflix 
SET country = 'Not Given'
WHERE country IS NULL;

The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization

--Show date_added nulls

SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;

--DELETE nulls

DELETE F...
Search
Clear search
Close search
Google apps
Main menu