Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Original Credit goes to: Oleh Onyshchak
Original Owner: https://www.kaggle.com/datasets/jacksoncrow/stock-market-dataset?resource=download
rawData (.CSVs) Information:
"This dataset contains historical data of daily prices for each ticker (minus a few incompatible tickers, such as CARR# and UTX#) - currently trading on NASDAQ. The up to date list is available from nasdaqtrader.com.
The historic data was retrieved from Yahoo finance via yfinance python package."
Each file contains data from 01/04/2016 to 04/01/2020.
cleanData (.CSVs) & .ipynb (Python code) Information:
This edition contains my .ipynb notebook for user replication within JupyterLab and code transparency via Kaggle, this dataset is then cleaned via Python & pandas and used to create the final Tableau Dashboard linked below:
My Tableau Dashboard: https://public.tableau.com/app/profile/jack3951/viz/TopStocksAnalysisPythonpandas/Dashboard1
Enjoy!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Global Biotic Interactions (GloBI, www.globalbioticinteractions.org) provides an infrastructure and data service that aggregates and archives known biotic interaction databases to provide easy access to species interaction data. This project explores the coverage of GloBI data against known taxonomic catalogues in order to identify 'gaps' in knowledge of species interactions. We examine the richness of GloBI's datasets using itself as a frame of reference for comparison and explore interaction networks according to geographic regions over time. The resulting analysis and visualizations intend to provide insights that may help to enhance GloBI as a resource for research and education.
Spatial and temporal biotic interactions data were used in the construction of an interactive Tableau map. The raw data (IVMOOC 2017 GloBI Kingdom Data Extracted 2017 04 17.csv) was extracted from the project-specific SQL database server. The raw data was clean and preprocessed (IVMOOC 2017 GloBI Cleaned Tableau Data.csv) for use in the Tableau map. Data cleaning and preprocessing steps are detailed in the companion paper.
The interactive Tableau map can be found here: https://public.tableau.com/profile/publish/IVMOOC2017-GloBISpatialDistributionofInteractions/InteractionsMapTimeSeries#!/publish-confirm
The companion paper can be found here: doi.org/10.5281/zenodo.814979
Complementary high resolution visualizations can be found here: doi.org/10.5281/zenodo.814922
Project-specific data can be found here: doi.org/10.5281/zenodo.804103 (SQL server database)
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is a cleaned version of the Chicago Crime Dataset, which can be found here. All rights for the dataset go to the original owners. The purpose of this dataset is to display my skills in visualizations and creating dashboards. To be specific, I will attempt to create a dashboard that will allow users to see metrics for a specific crime within a given year using filters and metrics. Due to this, there will not be much of a focus on the analysis of the data, but there will be portions discussing the validity of the dataset, the steps I took to clean the data, and how I organized it. The cleaned datasets can be found below, the Query (which utilized BigQuery) can be found here and the Tableau dashboard can be found here.
The dataset comes directly from the City of Chicago's website under the page "City Data Catalog." The data is gathered directly from the Chicago Police's CLEAR (Citizen Law Enforcement Analysis and Reporting) and is updated daily to present the information accurately. This means that a crime on a specific date may be changed to better display the case. The dataset represents crimes starting all the way from 2001 to seven days prior to today's date.
Using the ROCCC method, we can see that: * The data has high reliability: The data covers the entirety of Chicago from a little over 2 decades. It covers all the wards within Chicago and even gives the street names. While we may not have an idea for how big the sample size is, I do believe that the dataset has high reliability since it geographically covers the entirety of Chicago. * The data has high originality: The dataset was gained directly from the Chicago Police Dept. using their database, so we can say this dataset is original. * The data is somewhat comprehensive: While we do have important information such as the types of crimes committed and their geographic location, I do not think this gives us proper insights as to why these crimes take place. We can pinpoint the location of the crime, but we are limited by the information we have. How hot was the day of the crime? Did the crime take place in a neighborhood with low-income? I believe that these key factors prevent us from getting proper insights as to why these crimes take place, so I would say that this dataset is subpar with how comprehensive it is. * The data is current: The dataset is updated frequently to display crimes that took place seven days prior to today's date and may even update past crimes as more information comes to light. Due to the frequent updates, I do believe the data is current. * The data is cited: As mentioned prior, the data is collected directly from the polices CLEAR system, so we can say that the data is cited.
The purpose of this step is to clean the dataset such that there are no outliers in the dashboard. To do this, we are going to do the following: * Check for any null values and determine whether we should remove them. * Update any values where there may be typos. * Check for outliers and determine if we should remove them.
The following steps will be explained in the code segments below. (I used BigQuery for this so the coding will follow BigQuery's syntax) ```
SELECT
*
FROM
portfolioproject-350601.ChicagoCrime.Crime
LIMIT 1000;
SELECT
*
FROM
portfolioproject-350601.ChicagoCrime.Crime
WHERE
unique_key IS NULL OR
case_number IS NULL OR
date IS NULL OR
primary_type IS NULL OR
location_description IS NULL OR
arrest IS NULL OR
longitude IS NULL OR
latitude IS NULL;
DELETE FROM
portfolioproject-350601.ChicagoCrime.Crime
WHERE
unique_key IS NULL OR
case_number IS NULL OR
date IS NULL OR
primary_type IS NULL OR
location_description IS NULL OR
arrest IS NULL OR
longitude IS NULL OR
latitude IS NULL;
SELECT unique_key, COUNT(unique_key) FROM `portfolioproject-350601.ChicagoCrime....
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The global Data Preparation Platform market is poised for substantial growth, estimated to reach $15,600 million by the study's end in 2033, up from $6,000 million in the base year of 2025. This trajectory is fueled by a Compound Annual Growth Rate (CAGR) of approximately 12.5% over the forecast period. The proliferation of big data and the increasing need for clean, usable data across all business functions are primary drivers. Organizations are recognizing that effective data preparation is foundational to accurate analytics, informed decision-making, and successful AI/ML initiatives. This has led to a surge in demand for platforms that can automate and streamline the complex, time-consuming process of data cleansing, transformation, and enrichment. The market's expansion is further propelled by the growing adoption of cloud-based solutions, offering scalability, flexibility, and cost-efficiency, particularly for Small & Medium Enterprises (SMEs). Key trends shaping the Data Preparation Platform market include the integration of AI and machine learning for automated data profiling and anomaly detection, enhanced collaboration features to facilitate teamwork among data professionals, and a growing focus on data governance and compliance. While the market exhibits robust growth, certain restraints may temper its pace. These include the complexity of integrating data preparation tools with existing IT infrastructures, the shortage of skilled data professionals capable of leveraging advanced platform features, and concerns around data security and privacy. Despite these challenges, the market is expected to witness continuous innovation and strategic partnerships among leading companies like Microsoft, Tableau, and Alteryx, aiming to provide more comprehensive and user-friendly solutions to meet the evolving demands of a data-driven world. Here's a comprehensive report description on Data Preparation Platforms, incorporating the requested information, values, and structure:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The population of Metro Vancouver (20110729Regional Growth Strategy Projections Population, Housing and Employment 2006 – 2041 File) will have increased greatly by 2040, and finding a new source of reservoirs for drinking water (2015_ Water Consumption_ Statistics File) will be essential. This issue of drinking water needs to be optimized and estimated (Data Mining file) with the aim of developing the region. Three current sources of water reservoirs for Metro Vancouver are Capilano, Seymour, and Coquitlam, in which the treated water is being supplied to the customer. The linear optimization (LP) model (Optimization, Sensitivity Report File) illustrates the amount of drinking water for each reservoir and region. In fact, the B.C. government has a specific strategy for the growing population till 2040, which leads them toward their goal. In addition, another factor is the new water source for drinking water that needs to be estimated and monitored to anticipate the feasible water source (wells) until 2040. As such, the government will have to make a decision on how much groundwater is used. The goal of the project is two steps: (1) an optimization model for three water reservoirs, and (2) estimating the new source of water to 2040. The process of data analysis for the project includes: the data is analyzed with six software—Trifacta Wrangler, AMPL, Excel Solver, Arc GIS, and SQL—and is visualized in Tableau. 1. Trifacta Wrangler Software clean data (Data Mining file). 2. AMPL and Solver Excel Software optimize drinking water consumption for Metro Vancouver (data in the Optimization and Sensitivity Report file). 3. ArcMap collaborates the raw data and result of the optimization water reservoir and estimating population till 2040 with the ArcGIS software (GIS Map for Tableau file). 4. Visualizing, estimating, and optimizing the source of drinking water for Metro Vancouver until 2040 with SQL software in Tableau (export tableau data file).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Raw data, clean data, and SQL query output tables as spreadsheets to support Tableau story and github repository available at https://github.com/brittabeta/Bellabeat-Case-Study-SQL-Excel-Tableau
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
An interactive Tableau dashboard analyzing key HR metrics—attrition, recruitment, performance, and diversity—to empower data-driven workforce decisions. Includes clean datasets, Tableau workbook (.twb/.twbx), and step-by-step insights.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset encompasses the metadata drawn from preserving and visualizing the Rural Route Nomad Photo and Video Collection. The collection consists of 14,058 born-digital objects shot on over a dozen digital cameras in over 30 countries, on seven continents from the end of 2008 through 2009. Metadata was generated using ExifTool, along with manual means, utilizing OpenRefine and Excel to parse and clean.
The dataset was a result of an overriding project to preserve the digital content of the Rural Route Nomad Collection, and then visualize photographic specs and geographic details with charts, graphs and maps in Tableau. A description of the project as a whole is publicly forthcoming. Visualizations can be found at https://public.tableau.com/app/profile/alan.webber5364.
Facebook
Twitterhttps://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The Embedded Analytics Solutions market is experiencing robust growth, projected to reach $68.88 million in 2025 and exhibiting a Compound Annual Growth Rate (CAGR) of 13.90%. This expansion is fueled by several key drivers. The increasing need for data-driven decision-making across various industries, coupled with the rising adoption of cloud-based solutions and the proliferation of big data, are significantly contributing to market growth. Furthermore, the growing demand for real-time business intelligence and the ease of integrating analytics directly into applications are fostering wider adoption. The market is segmented by solution (software and services), organization size (SMEs and large enterprises), deployment (cloud and on-premise), and end-user vertical (BFSI, IT & Telecommunications, Healthcare, Retail, Energy & Utilities, Manufacturing, and others). The competitive landscape is populated by established players like SAS, IBM, and Microsoft, alongside emerging innovative companies. Growth is expected to be particularly strong in North America and Europe initially, followed by increasing penetration in the Asia-Pacific region driven by technological advancements and rising digital adoption rates. The on-premise deployment model, while still significant, is gradually yielding to the cloud, driven by scalability, cost-effectiveness, and accessibility benefits. The continued growth trajectory is expected to be influenced by advancements in artificial intelligence (AI) and machine learning (ML), which will further enhance the capabilities of embedded analytics solutions. However, challenges such as data security concerns, the complexity of implementation, and the need for skilled professionals to manage and interpret data could act as potential restraints. Nevertheless, the overall market outlook remains positive, with significant opportunities for growth across all segments. The increasing emphasis on data visualization and user-friendly dashboards is also expected to further fuel market adoption, particularly amongst smaller organizations that traditionally lacked access to sophisticated analytical tools. The competitive landscape will likely witness mergers, acquisitions, and strategic partnerships as players strive to enhance their product offerings and expand their market share. Recent developments include: August 2022 - SAS and SingleStore have announced a collaboration to help organizations remove barriers to data access, maximize performance and scalability, and uncover key data-driven insights. SAS Viya with SingleStore enables the use of SAS analytics and AI technology on data stored in SingleStore's cloud-native real-time database. The integration provides flexible, open access to curated data to help accelerate value for cloud, hybrid, and on-premises deployments., July 2022 - TIBCO announced the launch of TIBCO ModelOps, which helps customers simplify and scale cloud-based analytic model management, deployment, monitoring, and governance; TIBCO ModelOps addresses the requirement for speed in deploying AI and draws from TIBCO's leadership in data science data visualization and business intelligence. This aids AI teams in confronting critical deployment hurdles like ease-of-applying analytics to applications, identification and mitigation of bias, and transparency and manageability of an algorithm's behavior within business-critical applications.. Key drivers for this market are: Increasing Demand for Advanced Analytical Techniques for Business Data, Increasing number of Data Driven Organizations; Increasing Adoption of Mobile BI and Big Data Analytics; Increasing Use of Mobile Devices and Cloud Computing Technologies. Potential restraints include: Licensing Challenges and Higher Associated Costs. Notable trends are: Increasing Use of Mobile Devices and Cloud Computing Technologies to Witness Significant Growth.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Cleaning this data took some time due to many NULL values, typos, and unorganized collection. My first step was to put the dataset into R and work my magic there. After analyzing and cleaning the data, I moved the data to Tableau to create easily understandable and helpful graphs. This step was a learning curve because there are so many potential options inside Tableau. Finding the correct graph to share my findings while keeping the stakeholders' tasks in mind was my biggest obstacle.
Firstly I needed to combine the 4 datasets into 1, I did this using the rbind() function.
Step two was to remove typos or poorly named columns.
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "tripduration"] <- "trip_duration"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "bikeid"] <- "bike_id"'
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "usertype"] <- "user_type"
colnames(Cyclistic_Data_2019)[colnames(Cyclistic_Data_2019) == "birthyear"] <- "birth_year"
Next step was to remove all NULL and over exaggerated numbers. Such as trip durations more than 10 hours long.
library(dplyr)
Cyclistic_Clean_v2 <- Cyclistic_Data_2019 %>%
filter(across(where(is.character), ~ . != "NULL")) %>%
type.convert(as.is = TRUE)
Once removing the NULL data, it was time to remove potential typos and poorly collected data. I could only identify exaggerated data under the "trip_duration" column. Finding that there were multiple cases of 2,000,000 + second trips. To find these large values, I used the count() function.
Cyclistic_Clean_v2 %>% count(Cyclistic_Clean_v2, trip_duration > "30000")
After finding multiple instances of this, I ran into a hard spot, the trip_duration column was categorized as a character when it needed to be numeric to be further cleaned. it took me quite a while to find out that this was an issue, and then I remembered the class() function. With this, I was easily able to identify that the classification was wrong
class(Cyclistic_Clean_v2$trip_duration)
Once identifying the classification, I still had some work to do before converting it to an integer as it contained quotations, periods, and a trailing 0. To remove these I used the gsub() function.
Cyclistic_Clean_v2$trip_duration <- gsub(".0", "", Cyclistic_Clean_v2$trip_duration)
Cyclistic_Clean_v2$trip_duration <- gsub('"', '', Cyclistic_Clean_v2$trip_duration)
Now that unwanted characters are gone, we can convert the column into numeric.
Cyclistic_Clean_v2$trip_duration <- as.numeric(Cyclistic_Clean_v2$trip_duration)
Doing this allows Tableau and R to read the data properly to create graphs without error.
Next I created a backup dataset incase there was any issue while exporting.
Cyclistic_Clean_v3 <- Cyclistic_Clean_v2
write.csv(Cyclistic_Clean_v2,"Folder.Path\Cyclistic_Data_Cleaned_2019.csv", row.names = FALSE)
After exporting I came to the conclusion that I should have put together a more accurate change log rather than brief notes. That is one major learning lesson I will take away from this project.
All around, I had a lot of fun using R to transform and analyze the data. I learned many of different ways to efficiently clean data.
Now onto the fun part! Tableau is a very good tool to learn. There are so many different ways to bring your data to life and show your creativity inside your work. After a few guides and errors, I could finally start building graphs to bring the stakeholders' tasks to fruition.
Please note this are all made in tableau and meant to be interactive.
Here you can find the relation between male and female riders.
View post on imgur.com
Male vs Female tripduration with usertype
View post on imgur.com
Busiest stations filtered by months. (This is meant to be interactive.)
View post on imgur.com
Most popular starting stations.
View post on imgur.com
Most popular ending stations.
View post on imgur.com
My main goal was to help find out how Cyclistic can convert casual riders into subscribers. Here is my findings.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This is a derivative dataset created for my Tableau visualisation project. It's derived from two other datasets on Kaggle:
Steam Games Dataset by Martin Bustos
Video Games on Steam [in JSON] by Sujay Kapadnis
From the Martin Bustos dataset, I removed the games without reviews and selected the most relevant features to create the following dashboard:
https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F2473556%2Fce81900b3761554ce9acfc7ef25189b6%2Fsteam-dashboard.png?generation=1704630691045231&alt=media" alt="">
From the Sujay Kapadnis dataset I added the data on game duration from HowLongToBeat.com
The following notebooks contain exploratory data analysis and the transformations I used to generate this dataset from the two original datasets:
Steam Games - Exploratory Data Analysis
Steam Games - Data Transformation
View the live dashboard on Tableau Public:
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is to showcase my Google Data Analytics Capstone project by using Excel to clean data, R to analyze data for insights, and Tableau to create visualizations of the data
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
The Superstore Sales Data dataset, available in an Excel format as "Superstore.xlsx," is a comprehensive collection of sales and customer-related information from a retail superstore. This dataset comprises* three distinct tables*, each providing specific insights into the store's operations and customer interactions.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8">
https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618">
https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">
https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37">
https://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">
BigQuery | Big Query data Cleaning
Tableau | Creating Visuals with Tableau
Sheets | Cleaning NULL Values , creating data tables
R studio | Organizing and cleaning data to create a visual code
SQL SSMS | Transform, clean and manipulate Data
Linkedin | Survey Poll
https://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16">
https://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e">
https://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641">
https://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">
Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site
https://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d">
https://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087">
https://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270">
https://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e">
https://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985">
https://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497">
https://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2">
https://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305">
https://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
These Kaggle datasets offer a comprehensive analysis of the US real estate market, leveraging data sourced from Redfin via an unofficial API. It contains weekly snapshots stored in CSV files, reflecting the dynamic nature of property listings, prices, and market trends across various states and cities, except for Wyoming, Montana, and North Dakota, and with specific data generation for Texas cities. Notably, the dataset includes a prepared version, USA_clean_unique, which has undergone initial cleaning steps as outlined in the thesis. These datasets were part of my thesis; other two countries were France and UK.
These steps include: - Removal of irrelevant features for statistical analysis. - Renaming variables for consistency across international datasets. - Adjustment of variable value ranges for a more refined analysis.
Unique aspects such as Redfin’s “hot” label algorithm, property search status, and detailed categorizations of property types (e.g., single-family residences, condominiums/co-ops, multi-family homes, townhouses) provide deep insights into the market. Additionally, external factors like interest rates, stock market volatility, unemployment rates, and crime rates have been integrated to enrich the dataset and offer a multifaceted view of the real estate market's drivers.
The USA_clean_unique dataset represents a key step before data normalization/trimming, containing variables both in their raw form and categorized based on predefined criteria, such as property size, year of construction, and number of bathrooms/bedrooms. This structured approach aims to capture the non-linear relationships between various features and property prices, enhancing the dataset's utility for predictive modeling and market analysis.
See columns from USA_clean_unique.csv and my Thesis (Table 2.8) for exact column descriptions.
Table 2.4 and Section 2.2.3, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.
If you want to continue generating datasets yourself, see my Github Repository for code inspiration.
Let me know if you want to see how I got from raw data to USA_clean_unique.csv. Multiple steps include cleaning in Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming columns for consistency.
Facebook
TwitterThis is my personal project which I analyzed the main factor that leads me to select the TV show. This time, I used Python for web scrapping (or known as crawling) the data from IMDb.com and used Spreadsheet to clean the dataset. Finally, I used Tableau to visualize the data.
This time, I've utilized web-crawling to build up the database. For this project, I gathered the data from the top 100 TV shows listed by the user named 'carlosotsubo' from IMDB.com.
I sincerely thank the IMDb user named, 'carlosotsubo,' for providing the list of top 100 TV shows.
The following questions need to be answered:
After my own analysis, I've created the data visualization:
https://public.tableau.com/app/profile/jae.hwan.kim/viz/HowdoIchoosewhichTVshowtowatch/Dashboard1
If you guys give me feedback, I will be glad to hear! Thanks!
Facebook
TwitterODC Public Domain Dedication and Licence (PDDL) v1.0http://www.opendatacommons.org/licenses/pddl/1.0/
License information was derived automatically
This dataset contains a cleaned and transformed version of the public Divvy Bicycle Sharing Trip Data covering the period November 2024 to October 2025.
The original raw data is publicly released by the Chicago Open Data Portal,
and has been cleaned using Pandas (Python) and DuckDB SQL for faster analysis.
This dataset is now ready for direct use in:
- Exploratory Data Analysis (EDA)
- SQL analytics
- Machine learning
- Time-series/trend analysis
- Dashboard creation (Power BI / Tableau)
Original Data Provider:
Chicago Open Data Portal – Divvy Trips
License: Open Data Commons Public Domain Dedication (PDDL)
This cleaned dataset only contains transformations; no proprietary or restricted data is included.
ride_lengthday_of_weekhour_of_day.csv for optimized performanceride_idrideable_typestarted_atended_atstart_station_nameend_station_namestart_latstart_lngend_latend_lngmember_casualride_length (minutes)day_of_weekhour_of_dayThis dataset is suitable for: - DuckDB + SQL analytics - Pandas EDA - Visualization in Power BI, Tableau, Looker - Statistical analysis - Member vs. Casual rider behavioral analysis - Peak usage prediction
This dataset is not the official Divvy dataset, but a cleaned, transformed, and analysis-ready version created for educational and analytical use.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Contains links only as this script to extract data was used for a freelance project.
100.000 artwork links (just links). 50.000 artworks were scraped and contain data, ~40.000+ unique (artwork from the same artist).
While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection with.
Data set includes such columns: - Role - Company work at (if mentioned or extracted) - Date artwork was posted - Number of views - Number of likes - Number of comments - Which software was used - Which tags were used - Artwork title - Artwork URL
As you see the disclaimer, it's the first time I'm doing this. I want anyone who will be using this dataset to keep artists privacy by not using artist's email addresses in any way even though it's publicly available data published by them. Correct me if I said something wrong here.
Code that used to extract data from the Artstation you can find here, in the GitHub repository.
While transitioning from 3D modeling to Data Analytics and Python Programming I decided to create a personal project to analyze something I have a close connection. I really enjoyed seeing progression in the 3D world (games, feature films, etc).
The goal of this project was to better understand the process of gathering data, processing, cleaning, analyzing, and visualizing. Besides that, I wanted to understand what is the most popular software, tag, affiliation among artists.
To scrape data these Python libraries/packages were used:
- requests
- json
- googlesheets api
- selenium
- regex
To clean, analyze and visualize data:
- googlesheets
- tableau
Note: following visualizations contains data bias. Not every tag, affiliation has taken to count due to the difficulties of data extraction, and the mistakes I made.
https://user-images.githubusercontent.com/78694043/119978304-23cb0380-bfc2-11eb-8b70-e84100fa7630.png" alt="image">
https://user-images.githubusercontent.com/78694043/119978269-1ada3200-bfc2-11eb-981f-b8ad2c2c0ff1.png" alt="image">
https://user-images.githubusercontent.com/78694043/119978237-101f9d00-bfc2-11eb-9285-e0d9bcf688ee.png" alt="image">
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
These Kaggle datasets provide downloaded real estate listings from the French real estate market, capturing data from a leading platform in France (Seloger), reminiscent of the approach taken for the US dataset from Redfin and UK dataset from Zoopla. It encompasses detailed property listings, pricing, and market trends across France, stored in weekly CSV snapshots. The cleaned and merged version of all the snapshots is named as France_clean_unique.csv.
The cleaning process mirrored that of the US dataset, involving removing irrelevant features, normalizing variable names for dataset consistency with USA and UK, and adjusting variable value ranges to get rid of extreme outliers. To augment the dataset's depth, external factors like inflation rates, stock market volatility, and macroeconomic indicators have been integrated, offering a multifaceted perspective on France's real estate market drivers.
For exact column descriptions, see columns for France_clean_unique.csv and my thesis.
Table 2.5 and Section 2.2.1, which I refer to in the column descriptions, can be found in my thesis; see University Library. Click on Online Access->Hlavni prace.
If you want to continue generating datasets yourself, see my Github Repository for code inspiration.
Let me know if you want to see how I got from raw data to France_clean_unique.csv. There are multiple steps, including cleaning Tableau Prep and R, downloading and merging external variables to the dataset, removing duplicates, and renaming some columns.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Netflix is a popular streaming service that offers a vast catalog of movies, TV shows, and original contents. This dataset is a cleaned version of the original version which can be found here. The data consist of contents added to Netflix from 2008 to 2021. The oldest content is as old as 1925 and the newest as 2021. This dataset will be cleaned with PostgreSQL and visualized with Tableau. The purpose of this dataset is to test my data cleaning and visualization skills. The cleaned data can be found below and the Tableau dashboard can be found here .
We are going to: 1. Treat the Nulls 2. Treat the duplicates 3. Populate missing rows 4. Drop unneeded columns 5. Split columns Extra steps and more explanation on the process will be explained through the code comments
--View dataset
SELECT *
FROM netflix;
--The show_id column is the unique id for the dataset, therefore we are going to check for duplicates
SELECT show_id, COUNT(*)
FROM netflix
GROUP BY show_id
ORDER BY show_id DESC;
--No duplicates
--Check null values across columns
SELECT COUNT(*) FILTER (WHERE show_id IS NULL) AS showid_nulls,
COUNT(*) FILTER (WHERE type IS NULL) AS type_nulls,
COUNT(*) FILTER (WHERE title IS NULL) AS title_nulls,
COUNT(*) FILTER (WHERE director IS NULL) AS director_nulls,
COUNT(*) FILTER (WHERE movie_cast IS NULL) AS movie_cast_nulls,
COUNT(*) FILTER (WHERE country IS NULL) AS country_nulls,
COUNT(*) FILTER (WHERE date_added IS NULL) AS date_addes_nulls,
COUNT(*) FILTER (WHERE release_year IS NULL) AS release_year_nulls,
COUNT(*) FILTER (WHERE rating IS NULL) AS rating_nulls,
COUNT(*) FILTER (WHERE duration IS NULL) AS duration_nulls,
COUNT(*) FILTER (WHERE listed_in IS NULL) AS listed_in_nulls,
COUNT(*) FILTER (WHERE description IS NULL) AS description_nulls
FROM netflix;
We can see that there are NULLS.
director_nulls = 2634
movie_cast_nulls = 825
country_nulls = 831
date_added_nulls = 10
rating_nulls = 4
duration_nulls = 3
The director column nulls is about 30% of the whole column, therefore I will not delete them. I will rather find another column to populate it. To populate the director column, we want to find out if there is relationship between movie_cast column and director column
-- Below, we find out if some directors are likely to work with particular cast
WITH cte AS
(
SELECT title, CONCAT(director, '---', movie_cast) AS director_cast
FROM netflix
)
SELECT director_cast, COUNT(*) AS count
FROM cte
GROUP BY director_cast
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;
With this, we can now populate NULL rows in directors
using their record with movie_cast
UPDATE netflix
SET director = 'Alastair Fothergill'
WHERE movie_cast = 'David Attenborough'
AND director IS NULL ;
--Repeat this step to populate the rest of the director nulls
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET director = 'Not Given'
WHERE director IS NULL;
--When I was doing this, I found a less complex and faster way to populate a column which I will use next
Just like the director column, I will not delete the nulls in country. Since the country column is related to director and movie, we are going to populate the country column with the director column
--Populate the country using the director column
SELECT COALESCE(nt.country,nt2.country)
FROM netflix AS nt
JOIN netflix AS nt2
ON nt.director = nt2.director
AND nt.show_id <> nt2.show_id
WHERE nt.country IS NULL;
UPDATE netflix
SET country = nt2.country
FROM netflix AS nt2
WHERE netflix.director = nt2.director and netflix.show_id <> nt2.show_id
AND netflix.country IS NULL;
--To confirm if there are still directors linked to country that refuse to update
SELECT director, country, date_added
FROM netflix
WHERE country IS NULL;
--Populate the rest of the NULL in director as "Not Given"
UPDATE netflix
SET country = 'Not Given'
WHERE country IS NULL;
The date_added rows nulls is just 10 out of over 8000 rows, deleting them cannot affect our analysis or visualization
--Show date_added nulls
SELECT show_id, date_added
FROM netflix_clean
WHERE date_added IS NULL;
--DELETE nulls
DELETE F...