18 datasets found

Cyclistic_data_visualization
kaggle.com
Updated Jun 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 12, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mark Woychick
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

Content

I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

Acknowledgements

Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

Inspiration

This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

I am also interested to see what others did with this analysis - what were the findings and insights you found?
DA Analyst Capstone Project
kaggle.com
zip
Updated May 18, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tara Jacobs (2024). DA Analyst Capstone Project [Dataset]. https://www.kaggle.com/datasets/tarajacobs/mock-user-profiles-from-social-networks
Explore at:
zip(8714 bytes)Available download formats
Dataset updated
May 18, 2024
Authors
Tara Jacobs
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8"> https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618"> https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">

https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37"> https://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">

BigQuery | Big Query data Cleaning

Tableau | Creating Visuals with Tableau

Sheets | Cleaning NULL Values , creating data tables

R studio | Organizing and cleaning data to create a visual code

SQL SSMS | Transform, clean and manipulate Data

Linkedin | Survey Poll

https://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641"> https://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">

Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site

https://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d"> https://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087"> https://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270"> https://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e"> https://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985"> https://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497"> https://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305"> https://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">
n
A dataset of 5 million city trees from 63 US cities: species, location,...
data.niaid.nih.gov
data-staging.niaid.nih.gov
+2more
zip
Updated Aug 31, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz (2022). A dataset of 5 million city trees from 63 US cities: species, location, nativity status, health, and more. [Dataset]. http://doi.org/10.5061/dryad.2jm63xsrf
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2jm63xsrf
Dataset updated
Aug 31, 2022
Dataset provided by
Worcester Polytechnic Institute
Harvard University
Cornell University
Stanford University
The Biota of North America Program (BONAP)
Authors
Dakota McCoy; Benjamin Goulet-Scott; Weilin Meng; Bulent Atahan; Hana Kiros; Misako Nishino; John Kartesz
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
United States
Description
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.

Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.

Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.
Cleaned_Cyclistic_Data
kaggle.com
zip
Updated Dec 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Stanley Prawiradjaja (2021). Cleaned_Cyclistic_Data [Dataset]. https://www.kaggle.com/datasets/stanleyprawiradjaja/cleaned-cyclistic-data
Explore at:
zip(215550772 bytes)Available download formats
Dataset updated
Dec 22, 2021
Authors
Stanley Prawiradjaja
Description
This is a bike sharing data for a fictitious Cyclistic company. Actual data is based on Divvy, a Chicago bike sharing. The original data was cleaned using Postgres and Google Sheet This data has been cleaned to exclude data that's missing the station IDs and trips that has duration over 24 hours. Several columns was created to calculate trip durations and day of the week.

--- Finding duplicate , assumed duplicated ID means duplicated data. Result = no duplicated data found---

SELECT ride_id, COUNT(ride_id) AS ride_id_count FROM "Cyclistic" GROUP BY ride_id HAVING COUNT(ride_id)>1

--- Extract station table for data cleaning ----

SELECT DISTINCT start_station_name, start_station_id, end_station_id, end_station_name FROM "Cyclistic" ORDER BY start_station_name ;

Using Google Sheet Clean start_station_id code, clean missing station name, clean station id with extra .0, Assign id to NULL station data

---- Update main table with cleaned station name and id ---- UPDATE "Cyclistic" SET end_lng = lng FROM "cleaned_station_info" WHERE start_station_id = id;

---- Original data has latitude and longitude data that varies by small amount of decimal points. To make the data more uniform, the latitude and longitude were averaged out based on the station ID and use 8 decimal points for location accuracy. Data was then checked using Google Maps to make sure data is accurate to the nearest Divvy location in Chicago. ---

SELECT DISTINCT start_station_id, start_station_name, ROUND(AVG(start_lat)::DECIMAL,8) lat, ROUND(AVG(start_lng)::DECIMAL,8) lng FROM "Cyclistic" GROUP BY start_station_id, start_station_name ORDER BY start_station_id

--- Create a cleaned table for export excluding data that are less than 2 minutes and more than 24 hours. Based on data where duration less than 2 minutes, the ride always ends up at the same station. It is assumed that the rider canceled the ride or had trouble using the bike, therefore this data is excluded. For data more than 24 hours it's assumed that there's an error in docking in the bicycle or other problem with logging out of the ride. New table also exclude NULL data where start_station_name and end_station_name is missing ---

SELECT
*
FROM (
SELECT
ride_id, member_casual, rideable_type,
start_station_id, start_station_name,
end_station_id, end_station_name,
started_at, ended_at,
ended_at - started_at as duration,
start_lat, start_lng, end_lat, end_lng\

FROM "Cyclistic"\

WHERE start_station_name IS NOT NULL AND end_station_name IS NOT NULL ) AS duration_tbl\

WHERE duration >= '00:02' and duration <= '24:00' \
a
Mesoamerican Pyramid Sample Spreadsheet
hub.arcgis.com
Updated Mar 7, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tennessee Geographic Alliance (2019). Mesoamerican Pyramid Sample Spreadsheet [Dataset]. https://hub.arcgis.com/documents/239d8d8128f8496181b68367e26eea04
Explore at:
Dataset updated
Mar 7, 2019
Dataset authored and provided by
Tennessee Geographic Alliance
Area covered
Mesoamerica
Description
Follow these instructions to use the Google Spreadsheet in your own activity. Begin by copying the Google Spreadsheet into your own Google Drive account. Prefill the username column for your students/participants. This will help keep the students from overwriting their peers' work.Change the editing permissions for the spreadsheet and share it with your students/participants.Demonstrate what data goes into each column from the Wikipedia page. Be sure to demonstrate how to find the latitude and longitude from Wikipedia. For the images, make sure the students copy the url that ends in the appropriate file type (jpg, png, etc).Be prepared for lots of mistakes. This is a great learning opportunity to talk about data quality. When the students are done completing the spreadsheet, check the spreadsheet for obvious errors. Pay special attention to the sign of the longitude. All of those values should be negative. Download the spreadsheet as a CSV.Log into your AGO Org account.Click on the Content tab -> Add item -> From my computerUpload the CSV and save it as a layer feature. Be sure to include a few tags (Mesoamerica, pyramid, Aztec, Maya would be good ones).Once the layer has been uploaded and converted into a feature layer, click the Settings button and check Delete Protection and save. From the feature layer Overview tab, change the share settings to share with your students. I usually set up a group (something like Mesoamerica), add the students to the group, then share the feature layer with that group.From here explore the data. Symbolize the data by culture to see if there are spatial patterns to their distribution. Symbolize the data by height to see if some cultures built taller pyramids or if taller pyramids were confined to certain regions. Students can also set up the pop-ups to use the image URL in the data.From here, students can save their maps, add additional data from ArcGIS Online, create story maps, etc. If you are looking for more great data, from your ArcGIS Online map, choose Add -> Add Layer from Web and paste the following into the URL. https://services1.arcgis.com/TQSFiGYN0xveoERF/arcgis/rest/services/MesoAmerican_civs/FeatureServerImage thumbnail is from Wikipedia.
Global Salary DataSet 2022
kaggle.com
zip
Updated Mar 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
RicardoAUgas (2023). Global Salary DataSet 2022 [Dataset]. https://www.kaggle.com/datasets/ricardoaugas/salary-transparency-dataset-2022/data
Explore at:
zip(1444962 bytes)Available download formats
Dataset updated
Mar 1, 2023
Authors
RicardoAUgas
Description
This compensation data originated from a large google sheets document that became viral after it was posted on LinkedIn by Christen Nino De Guzman, Program Manager at Google. The LinkedIn post stated as follows:

"Let's talk #SalaryTransparency! A few weeks ago, I encouraged others to share their salaries in an anonymous Google form and now more than 58,000 people have come together to share details of their offers in a google sheet. Everything from sign on bonus, annual salary, diverse identity and age. Many fields that traditional sites like Glassdoor don’t include.

All of the responses were anonymous and publicly viewable. Huge shoutout to Brennan Pankiw for creating the survey! You can view responses here: https://lnkd.in/gPkYFQsN "

The Google sheets became extremely laggy and it crashed often as a result of the number of people accesing the document and its size. Therefore, I downloaded the raw data and took it upon myself to clean up the data and create an user friendly visualization of the compensation data. Using the skills I recently acquired during my Harvard Business Analytics Program, I began to identify all of the gaps, typos, variations of company names, and false data using tools such as R studio and Tableau.

DATA

To anyone interested in both the raw data and the cleaned data presented here, please reach out to Ricardo Ugas.

CONTACT

Linkedin: https://www.linkedin.com/in/ugas/ Resume/CV: https://bit.ly/3dIUmCo Email: ricardo.ugas.analytics@gmail.com ricardo.ugasgonzalez@postgrad.manchester.ac.uk ricardo.ugas@mail.analytics.hbs.edu ugasra@miamioh.edu Phone: +1 513 526 6598
Cyclistic Bike-Share Google Capstone Project
kaggle.com
zip
Updated Aug 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dibyajyoti Mehera (2022). Cyclistic Bike-Share Google Capstone Project [Dataset]. https://www.kaggle.com/datasets/dibyajyotimehera/cyclistic-bikeshare-google-capstone-project/code
Explore at:
zip(215455465 bytes)Available download formats
Dataset updated
Aug 11, 2022
Authors
Dibyajyoti Mehera
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains non-confidential information about the trip details of the users of an imaginary bike company called Cyclistic. The dataset collectively contains approximately 5.9 million data. This is quite a huge amount of data to handle through spreadsheet applications such as MS Excel or Google Sheets, etc. Although it is possible to complete the entire analysis on MS Excel, I would recommend to use BigQuery or R to clean and analyze the data effectively.

The dataset contains following files

202107-divvy-tripdata.csv 202108-divvy-tripdata.csv 202109-divvy-tripdata.csv 202110-divvy-tripdata.csv 202111-divvy-tripdata.csv 202112-divvy-tripdata.csv 202201-divvy-tripdata.csv 202202-divvy-tripdata.csv 202203-divvy-tripdata.csv 202204-divvy-tripdata.csv 202205-divvy-tripdata.csv 202206-divvy-tripdata.csv
Cyclistic Bike Share: A Case Study
kaggle.com
zip
Updated Jul 25, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Casey Kellerhals (2023). Cyclistic Bike Share: A Case Study [Dataset]. https://www.kaggle.com/datasets/caskelle/cyclistic-bike-share-a-case-study/code
Explore at:
zip(269575250 bytes)Available download formats
Dataset updated
Jul 25, 2023
Authors
Casey Kellerhals
Description
The Mission Statement

Cyclistic, a bike sharing company, wants to analyze their user data to find the main differences in behavior between their two types of users. The Casual Riders are those who pay for each ride and the Annual Member who pays a yearly subscription to the service.

PHASE 1 : ASK

Key objectives: 1.Identify The Business Task: - Cyclistic wants to analyze the data to find the key differences between Casual Riders and Annual Members. The goal of this project is to reach out to the casual riders and incentivize them into paying for the annual subscription.

Consider Key Stakeholders:

The key stakeholders in this project are the executive team and the director of marketing, Lily Moreno.

PHASE 2 : Prepare

Key objectives: 1. Download Data And Store It Appropriately - Downloaded the data as .csv files, which were saved in their own folder to keep everything organized. I then uploaded those files into BigQuery for cleaning and analysis. For this project I downloaded all of 2022 and up to May of 2023, as this is the most recent data that I have access to.

Identify How It's Organized

The data is organized into months, from 01-2022 to 05-2023.

Sort and Filter The Data and Determine The Credibility of The Data

For this data I used BigQuery and SQL in order to sort, filter and analyze the credibility of the data. The data is collected first hand by Cyslistic and there is a lot of information to work with. I filtered out the data that I wanted to work with, the data that I chose were the types of bikes, the types of members and the date the bikes were used.

PHASE 3 : Process

Key objectives: 1.Clean The Data and Prepare The Data For Analysis: -I used some simple SQL code in order to determine that no members were missing, that no information was repeated and that there were no misspellings in the data as well.

--no misspelling in either member or casual. This ensures that all results will not have missing information. SELECT DISTINCT member_casual
FROM table

--This shows how many casual riders and members used the service, should add up to the numb of rows in the dataset SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM table GROUP BY member_type

--Shows that every bike has a distinct ID. SELECT DISTINCT ride_id FROM table

--Shows that there are no typos in the types of bikes, so no data will be missing from results. SELECT DISTINCT rideable_type FROM table

PHASE 4 : Analyze

Key objectives: 1. Aggregate Your Data So It's Useful and Accessible -I had to write some SQL code so that I could combine all the data from the different files I had uploaded onto BigQuery

select rideable_type, started_at, ended_at, member_casual from table 1 union all select rideable_type, started_at, ended_at, member_casual from table 2 union all select rideable_type, started_at, ended_at, member_casual from table 3 union all select rideable_type, started_at, ended_at, member_casual from table 4 union all select rideable_type, started_at, ended_at, member_casual from table 5 union all select rideable_type, started_at, ended_at, member_casual from table 6 union all select rideable_type, started_at, ended_at, member_casual from table 7 union all select rideable_type, started_at, ended_at, member_casual from table 8 union all select rideable_type, started_at, ended_at, member_casual from table 9 union all select rideable_type, started_at, ended_at, member_casual from table10 union all select rideable_type, started_at, ended_at, member_casual from table 11 union all select rideable_type, started_at, ended_at, member_casual from table 12 union all select rideable_type, started_at, ended_at, member_casual from table 13 union all select rideable_type, started_at, ended_at, member_casual from table 14 union all select rideable_type, started_at, ended_at, member_casual from table 15 union all select rideable_type, started_at, ended_at, member_casual from table 16 union all select rideable_type, started_at, ended_at, member_casual from table 17

Identify trends and relationships -After I had aggregated all of the data I had chosen, I then ran SQL code to determine the trends and relationships contained within the data. After analyzing the data, I uploaded that data into google sheets to make the graphs to express those trends and make it easier to identify the key differences between Casual Riders and Annual Members.

--This shows how many casual and annual members used bikes SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM Aggregate Data Table GROUP BY member_type

![](https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F14378099%2Fe09c3496bf38d323f8323f52f67...
Boston Celtics Shooting Variables
kaggle.com
zip
Updated Oct 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
John Mimnaugh (2023). Boston Celtics Shooting Variables [Dataset]. https://www.kaggle.com/datasets/johnmimnaugh/boston-celtics-shooting-variables/data
Explore at:
zip(4738 bytes)Available download formats
Dataset updated
Oct 2, 2023
Authors
John Mimnaugh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Source - All data was collected from the NBA.com website and Basketball-Reference.com - To view the raw data, and steps I took to clean and format it, you can click the link below https://docs.google.com/spreadsheets/d/1bJnc1n-pXVjtqKul1NnjOq0mYl9-7FZy_CbM2gTmTLA/edit?usp=sharing

Context All data is from the 2022-2023, 82 game regular season

Inspiration I gathered this data to perform an analysis with the goal being to answer the questions: - From where did the Boston Celtics shoot the highest field goal percent? - When did the Boston Celtics shoot the highest field goal percent? - Under what conditions did the Boston Celtics shoot the highest field goal percent?
Mobile Legend : Bang Bang Draft Picks
kaggle.com
zip
Updated Jul 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerry Zani (2023). Mobile Legend : Bang Bang Draft Picks [Dataset]. https://www.kaggle.com/datasets/gerryzani/mlbb-draft-breakdown-patch-1768/
Explore at:
zip(3921479 bytes)Available download formats
Dataset updated
Jul 9, 2023
Authors
Gerry Zani
License
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Description
This data is a transformed version of the following dataset.

Mobile Legends Match Results Provided by MUHAMMAD RIZQI NUR

Huge shoutout to Rizqi and his wonderful work, providing this dataset. To see how he get this dataset on the first place, please see the original dataset (Provided by link above)

Aim : - This dataset is to analyze the draft picks using tableau to answer these questions 1. " If I have to play in a party of 2, which draft has highest chance of winning ? " 2. " If I have to play in a party of 3, which draft has the highest chance of winning ?"
3. " What are some heroes combinations that I need to avoid ?"

Additional Aim : 1. I uploaded this dataset as a way for me to get some public reviews of how I clean the data. 2. To share my step by step process to get the final output.

Notes & Caution : 1. I tried to clean the dataset using python, but get stuck midway on sorting problem. Link , Please excuse my lack of competence. 2. Hence, I start cleaning the data manually using spreadsheet. 3. Due to the google spreadsheet maximum cell limitation, I have to split the work process in 2 different files. 4. A lot of value paste copying to get the final output file 'MLBB Draft Sorted Cleaned.xlsx' so please use the formulas with caution.

Transformation Steps : 1. Remove duplicates by column 'battleId'. 2. Find and replace "Chang'e" into 'Change' (Changing double quote to single quote, and removing single quote on the original string) because it interferes with the Regex formula. 3. Split the data into winning drafts and losing drafts instead of 'left' and 'right' and 'win status' for each match (Win = Left side is winning draft, Lose = Right side is winning draft). 4. Use Regex to split list into individual cell for each hero pick 5. Create a sheet to list the names of individual heroes. 6. Changing each name into number id (To speed up the calculation, and ensure a better sorting) 7. Transpose into wide data (battleid as the colname, hero pick as rows). 8. Sort each column in ascending order. 9 Transpose back to long data. 10. Change back the number id into respective hero name. 11. Repeat process for the losing draft. 12. Combine both winning draft and losing draft into a single sheet.
GOOGLE Reports & Stock Prices 2004-TODAY
kaggle.com
zip
Updated Nov 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Emre Kaan Yılmaz (2025). GOOGLE Reports & Stock Prices 2004-TODAY [Dataset]. https://www.kaggle.com/datasets/emrekaany/googl-stock-price-and-financials
Explore at:
zip(117351 bytes)Available download formats
Dataset updated
Nov 21, 2025
Authors
Emre Kaan Yılmaz
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
🇺🇸 Alphabet Inc. (GOOGL) Comprehensive Financial Dataset

📌 Overview

Welcome to the GOOGL Financial Dataset! This dataset provides clear and easy-to-use quarterly financial statements (income statement, balance sheet, and cash flow) along with daily historical stock prices.

As a data engineer double majored with economics, I'll personally analyze and provide constructive feedback on all your work using this dataset. Let's dive in and explore Google's financial journey together!

🗃 Files Included

googl_daily_prices.csv: Historical daily stock prices.

googl_income_statement.csv: Quarterly income statements.

googl_balance_sheet.csv: Quarterly balance sheets.

googl_cash_flow_statement.csv: Quarterly cash flow statements.

📘 About This Dataset

This dataset offers a unique blend of long-term market performance and detailed financial metrics:

Time Series of Daily Prices: Track the historical performance of GOOGL stock from its early days up until now.

Quarterly Financial Statements: Dive into the income statements, balance sheets, and cash flow statements that reflect the company’s financial evolution.

Integrated Insights: Ideal for comprehensive financial analyses, forecasting, model building, and exploring the dynamic interplay between market performance and underlying business operations.

Whether you're building predictive models, performing deep-dive financial analysis, or exploring the evolution of one of the world's most innovative tech giants, this dataset is your go-to resource for clean, well-organized, and rich financial data.

💡 Tips for Using the Dataset

Visualize Stock Trends: Plot daily prices to quickly understand stock movements.

Financial Analysis: Compare income, balance sheet, and cash flow data to spot financial trends and health.

Predictive Modeling: Use this dataset to build forecasting models and predict future performance.

Combine Data: Merge price data with financial statements to analyze relationships and uncover deeper insights.

🔗 Works Great with My GOOGL News Dataset!

For a more comprehensive financial analysis, pair this dataset with my other Kaggle dataset:
👉 Google (Alphabet Inc.) Daily News — 2000 to 2025

That dataset includes:

Daily news articles from Finnhub

Headlines, summaries, sources, and timestamps

Covering GOOGL from 2000 to 2025

Combining both datasets unlocks powerful analysis such as:

Correlating news sentiment with stock price movements

Studying the impact of earnings reports and product launches

Developing event-driven forecasting models

Together, they give you everything you need for news + financial signal modeling.

📝 Column Descriptions

📈 googl_daily_prices.csv

date: Trading date.

1. open: Opening stock price on the trading day.

2. high: Highest stock price on the trading day.

3. low: Lowest stock price on the trading day.

4. close: Closing stock price on the trading day.

5. volume: Number of shares traded on the day.

📊 googl_income_statement.csv

fiscalDateEnding: Date marking the end of fiscal quarter.

reportedCurrency: Currency used in reporting (USD).

grossProfit: Revenue minus the cost of goods sold.

totalRevenue: Total income generated from operations.

costOfRevenue: Direct costs attributable to the production of goods.

costofGoodsAndServicesSold: Costs directly associated with goods sold.

operatingIncome: Earnings after operating expenses deducted.

sellingGeneralAndAdministrative: Administrative and general sales costs.

researchAndDevelopment: Expenses related to research and innovation.

operatingExpenses: Total operational costs.

investmentIncomeNet: Net income from investments.

netInterestIncome: Income earned from interest after deducting interest paid.

interestIncome: Income generated from interest-bearing investments.

interestExpense: Expenses from interest payments.

nonInterestIncome: Income from non-interest-bearing activities.

otherNonOperatingIncome: Additional income outside regular operations.

depreciation: Reduction in value of assets over time.

depreciationAndAmortization: Combined depreciation and amortization costs.

incomeBeforeTax: Income before taxation.

incomeTaxExpense: Taxes paid on earnings.

interestAndDebtExpense: Interest paid on debts.

netIncomeFromContinuingOperations: Profit from ongoing operations.

comprehensiveIncomeNetOfTax: Income after comprehensive expenses.

ebit: Earnings before interest and taxes.

ebitda: Earnings before interest, taxes, depreciation, and...
Clean dirty containers in Montevideo
kaggle.com
zip
Updated Aug 21, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rodrigo Laguna (2021). Clean dirty containers in Montevideo [Dataset]. https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo
Explore at:
zip(2862653769 bytes)Available download formats
Dataset updated
Aug 21, 2021
Authors
Rodrigo Laguna
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Montevideo
Description
Context

It all started during last #StayAtHome during 2020's pandemic: some neighbors worried about trash in Montevideo's container.

The goal is to automatically detect clean from dirty containers to ask for maintenance.

Want to know more about the entire process? Checkout this thread on how it began, and this other with respect to version 6 update process.

Content

Data is splitted in training/testing split, they are independent. However, each split contains several near duplicate images (typicaly, same container from different perspectives or days). Image sizes differ a lot among them.

There are four major sources: * Images taken from Google Street View, they are 600x600 pixels, automatically collected through its API. * Images contributed by individual persons, most of which I took my self. * Images taken from social networks (Twitter & Facebook) and news. * Images contributed by pormibarrio.uy - 17-11-2020

Images were took from green containers, the most popular in Montevideo, but also widely used in some other cities.

Current version (clean-dirty-garbage-containers-V6) is also available here or you can download it as follows: wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1 /p')&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6" -O clean-dirty-garbage-containers-V6.zip && rm -rf /tmp/cookies.txt This is specially useful if you want to download it in Google Colab.

This repo contains the code used during its building and documentation process, including the baselines for the purposed tasks.

Dataset on news

Since this is a hot topic in Montevideo, specially nowadays, with elections next week, it catch some attention from local press:

19-09-2020: Promueven solución de inteligencia artificial para evitar basura alrededor de contenedores, El Observador.

24-09-2020: Ingeniero en computación trabaja en proyecto para monitorear contenedores de basura, El Pais.

09-10-2020: Rodrigo Laguna: monitoreo de contenedores de basura, La mañana en casa.

Acknowledgements

Thanks to every single person who give me images from their containers. Special thanks to my friend Diego, whose idea of using google street view as a source of data really contributed to increase the dataset. And finally to my wife, who supported me during this project and contributed a lot to this dataset.

Citation

If you use these data in a publication, presentation, or other research project or product, please use the following citation:

Laguna, Rodrigo. 2021. Clean dirty containers in Montevideo - Version 6.1. url: https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo

@dataset{RLaguna-clean-dirty:2021, author = {Rodrigo Laguna}, title = {Clean dirty containers in Montevideo}, year = {2021}, url = {https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo}, version = {6.1} }

Contact

I'm on twitter, @ro_laguna_ or write to me r.laguna.queirolo at outlook.com

Future steps:

Add images from mapillary, an open source project similar to GoogleStreetView.

Keep going on with manually taken images.

Add any image from anyone who would like to contribute.

Develop & deploy a bot for automatically report container's status.

Translate docs to Spanish

Crop images to let one and only one container per image, taking most of the image

Changelog

19-05-2020: V1 - Initial version

20-05-2020: V2 - Include more training samples

12-09-2020: V3 - Include more training (+676) & testing (+64) samples:

train/clean from 574 to 1005 (+431)

train/dirty from 365 to 610 (+245)

test/clean from 100 to 128 (+28)

test/dirty from 100 to 136 (+36)

21-12-2020: V4 - Include more training (+367) & testing (+794) samples, including ~400...
Syrian people's interest in social media platforms
kaggle.com
zip
Updated Aug 4, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zualfekar Al Janzeer (2021). Syrian people's interest in social media platforms [Dataset]. https://www.kaggle.com/datasets/zualfiqaraljanzir/syrian-peoples-interest-in-social-media-platforms
Explore at:
zip(1226 bytes)Available download formats
Dataset updated
Aug 4, 2021
Authors
Zualfekar Al Janzeer
Area covered
Syria
Description
Context

Syria belongs to the third world countries, and it is one of the countries that have recently entered technology (interest in the Internet, social media platforms, scientific research, etc.) I obtained this data through a survey I conducted on Facebook with Syrian citizens, whether residing inside or outside Syrian territory. The number of those who conducted the survey is few, soon the survey will be more comprehensive and communication will be done through all available social media platforms

Content

The survey was for 10 days, using Google Sheet, 60 samples were obtained, some answers were not complete, which will force the researcher to clean the data

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Inspiration

I am still a beginner , so I would like to take some experience from here and help me to improve and increase my skills with data analysis
KPMG Virtual Internship
kaggle.com
zip
Updated Jan 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NEHA RAUTELA (2023). KPMG Virtual Internship [Dataset]. https://www.kaggle.com/datasets/neharautela/kpmg-virtual-internship
Explore at:
zip(2000950 bytes)Available download formats
Dataset updated
Jan 29, 2023
Authors
NEHA RAUTELA
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Introduction

This project is a Data Analytical assignment to analyze the data of a customer of KPMG, The Spyrocket Central, who deals in the selling of various brands of bicycle in all the four states of Australia. They had issues with stagnating sales and needed help in the following queries: • What are the trends in the underlying data? • Which customer segment has the highest customer value? • What do you propose should be Sprocket Central Pty Ltd ’s marketing and growth strategy? • What additional external datasets may be useful to obtain greater insights into customer preferences and propensity to purchase the products?

Dataset

The customers dataset consisted of the following data. • Transaction: Consisted Of data of transactions in the year 2017 along with transaction id, product id, brand, product class, product size, transaction date, product cost etc. • New customer list and Customer Demographics and consisting of addresses, job industry, customer names, job title, gender, wealth segment etc.

Process

The dataset was thoroughly cleaned and formatted due to the following data inconsistencies using Spreadsheets. •Transactions Sheet: - | column with issues online order - empty brand- empty product size -empty product class -empty product line - empty standard cost- empty product first sold- empty

• Customer Demographic Sheet: - column with issues gender- empty DOB- Inconsistent data job industry category- empty

• Customer Address Sheet: - column with issues states- abbreviations of states in place of state name

Analysis

After thoroughly analyzing the clean data, the following major points were paid attention to derive insights and ameliorate the business strategy.
• State-wise analysis to bring out the states with max and min sales • Most sold bikes according to types i.e., mountain bike, road etc. • Customers in different job industries • Customers in different age groups • Customers from different wealth segments.

Presentation

Insights of the analysis are presented in the presentation below. https://docs.google.com/presentation/d/1ECUmK4rGncjPVrRexL4kWPPIOFjdoXIqJkegYtC_wrk/edit?usp=share_link
Data from: Novel Corona Virus 2019 Dataset
kaggle.com
zip
Updated Jun 26, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
shivan kumar (2020). Novel Corona Virus 2019 Dataset [Dataset]. https://www.kaggle.com/shivan118/covid-19-world-jiteega
Explore at:
zip(455904 bytes)Available download formats
Dataset updated
Jun 26, 2020
Authors
shivan kumar
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

Johns Hopkins University has made an excellent dashboard using the affected case data. Data is extracted from the google sheets associated and made available here.

This data is available as CSV files in the Johns Hopkins Github repository. Please refer to the Github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.

Content

2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

This dataset has daily level information on the number of affected cases, deaths, and recovery from 2019 novel coronavirus. Please note that this is a time-series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan 2020 to 28 May 2020.

Column Description

The main file in this dataset is covid_19_data_cleaned.csv and the detailed descriptions are below. covid_19_data_cleaned.csv

ObservationDate - Date of the observation in MM/DD/YYYY

Province/State - Province or state of the observation (Could be empty when missing)

Country/Region - Country of observation

Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardized and so please clean before using it)

Confirmed - Cumulative number of confirmed cases till that date

Deaths - Cumulative number of deaths till that date

Active - Cumulative number of Active cases till that date
Covid19-India-Dataset
kaggle.com
zip
Updated Jun 1, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santhoshkumar (2020). Covid19-India-Dataset [Dataset]. https://www.kaggle.com/santhoshkumarv/covid19indiadata
Explore at:
zip(51146818 bytes)Available download formats
Dataset updated
Jun 1, 2020
Authors
Santhoshkumar
Area covered
India
Description
Context

From World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.

So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.

Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.

Edited: Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.

Content

2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC

This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.

The data is available from 22 Jan, 2020.

Column Description

Province/State - Province or state of the observation (Could be empty when missing) CountryReg - Country of observation Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it) Confirmed - Cumulative number of confirmed cases till that date Deaths - Cumulative number of of deaths till that date Recovered - Cumulative number of recovered cases till that date Lon Lat week - Week Number (1 To 52) Weeks Per Year

Acknowledgements

Johns Hopkins University for making the data available for educational and academic research purposes.

Inspiration

Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Anime Database for Recommendation system
kaggle.com
zip
Updated Jun 20, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
vishal mane (2020). Anime Database for Recommendation system [Dataset]. https://www.kaggle.com/vishalmane109/anime-recommendations-database
Explore at:
zip(3705416 bytes)Available download formats
Dataset updated
Jun 20, 2020
Authors
vishal mane
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

This dataset contains a total of 16737 unique animes. The reason for creating this dataset is the requirement of a clean dataset of Anime. I found a few datasets on anime, most of the datasets had the major anime but some dataset 1) doesn't have 'Genre' or 'Synopsis' of anime. For content-based recommendation, it is helpful if we have more information about anime 2) have duplicate data 3) missing data is represented by different notations.

Content

Anime_id :anime Id (as per myanimelist.net) Title : name of anime Genre :Main genre
Synopsis :Brief Discription Type
Producer Studio Rating :Rating of anime as pe myanimelist.net/ ScoredBy : Total no user scored given anime Popularity :Rank of anime based on popularity Members :No of members added given anime on their list Episodes : No. of episodes Source
Aired Link

Acknowledgements

This dataset is a combination of 2 datasets

https://docs.google.com/spreadsheets/d/1brguO5nGfXS-Fr1Xcf3pqPTQoBUPGLTYM_EMAA9yJFw/export?format=csv&id=1brguO5nGfXS-Fr1Xcf3pqPTQoBUPGLTYM_EMAA9yJFw&gid=0

https://www.kaggle.com/CooperUnion/anime-recommendations-database
Chicken Republic Lagos Sales Dataset –
kaggle.com
zip
Updated May 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fatolu Peter (2025). Chicken Republic Lagos Sales Dataset – [Dataset]. https://www.kaggle.com/datasets/olagokeblissman/chicken-republic-lagos-sales-dataset
Explore at:
zip(132062 bytes)Available download formats
Dataset updated
May 31, 2025
Authors
Fatolu Peter
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Area covered
Lagos
Description
Chicken Republic Lagos Sales Dataset – Fast Food Sales Analysis (NG)

📝 Dataset Overview: This dataset captures real-world retail transaction data from Chicken Republic outlets in Lagos, Nigeria. It provides detailed insights into fast food sales performance across different product categories, with columns that track revenue, quantity sold, and profit.

Ideal for anyone looking to:

Practice sales analysis

Build business intelligence dashboards

Forecast product performance

Analyze profit margins and pricing

🔍 Dataset Features: Column Name Description Date Date of each transaction Location Outlet or branch where the sale occurred Product Category Category of the product sold (e.g., Meals, Drinks, Snacks) Product Name of the specific product Quantity Sold Number of units sold Unit Price (NGN) Price per unit in Nigerian Naira Total Sales (NGN) Quantity × Unit Price Profit (NGN) Estimated profit from the sale

🎯 Use Cases: Build Power BI dashboards with slicers and filters by product category

Perform profitability analysis per outlet

Create forecast models to predict sales

Analyze customer preferences based on high-selling items

Create data storytelling visuals for retail presentations

🛠 Tools You Can Use: Excel / Google Sheets

Power BI / Tableau

Python (Pandas, Matplotlib, Seaborn)

SQL for querying sales trends

👤 Creator: Fatolu Peter (Emperor Analytics) Working actively on real-world retail, healthcare, and social media analytics. This dataset is part of my ongoing data project series (#Project 9 and counting!) 🚀

✅ LinkedIn Post: 🚨 New Dataset Drop for Analysts & BI Enthusiasts 📊 Chicken Republic Lagos Sales Dataset – Now on Kaggle! 🔗 Access here

Whether you’re a student, analyst, or business developer—this dataset gives you a clean structure for performing end-to-end sales analysis:

✅ Track daily sales ✅ Visualize profit by product category ✅ Create Power BI dashboards ✅ Forecast best-selling items

Columns include: Date | Location | Product | Quantity Sold | Unit Price | Total Sales | Profit

Built with love from Lagos 🧡 Let’s drive real insights with real data. Tag me if you build something amazing—I’d love to see it!

SalesAnalytics #ChickenRepublic #PowerBI #RetailData #KaggleDataset #NigerianBusiness #BusinessIntelligence #FatoluPeter #EmperorAnalytics #Project9 #DataForPractice
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Mark Woychick (2021). Cyclistic_data_visualization [Dataset]. https://www.kaggle.com/markwoychick/cyclistic-data-visualization

Cyclistic_data_visualization

Visualization data set for Google Data Analyst certification, Case Study 1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Jun 12, 2021

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Mark Woychick

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

Context

I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.

Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.

Content

I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.

Acknowledgements

Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html

Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy

Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)

Inspiration

This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.

One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.

I am also interested to see what others did with this analysis - what were the findings and insights you found?

Clear search

Close search

Google apps

Main menu

Cyclistic_data_visualization

Context

Content

Acknowledgements

Inspiration

DA Analyst Capstone Project

A dataset of 5 million city trees from 63 US cities: species, location,...

Cleaned_Cyclistic_Data

Mesoamerican Pyramid Sample Spreadsheet

Global Salary DataSet 2022

Cyclistic Bike-Share Google Capstone Project

Cyclistic Bike Share: A Case Study

The Mission Statement

PHASE 1 : ASK

PHASE 2 : Prepare

PHASE 3 : Process

PHASE 4 : Analyze

Boston Celtics Shooting Variables

Mobile Legend : Bang Bang Draft Picks

GOOGLE Reports & Stock Prices 2004-TODAY

📌 Overview

🗃 Files Included

📘 About This Dataset

💡 Tips for Using the Dataset

🔗 Works Great with My GOOGL News Dataset!

📝 Column Descriptions

📈 googl_daily_prices.csv

📊 googl_income_statement.csv

Clean dirty containers in Montevideo

Context

Content

Dataset on news

Acknowledgements

Citation

Contact

Future steps:

Changelog

Syrian people's interest in social media platforms

Context

Content

Acknowledgements

Inspiration

KPMG Virtual Internship

Introduction

Dataset

Process

Analysis

Presentation

Data from: Novel Corona Virus 2019 Dataset

Context

Content

Column Description

Covid19-India-Dataset

Context

Content

Column Description

Acknowledgements

Inspiration

Anime Database for Recommendation system

Context

Content

Acknowledgements

Chicken Republic Lagos Sales Dataset –

SalesAnalytics #ChickenRepublic #PowerBI #RetailData #KaggleDataset #NigerianBusiness #BusinessIntelligence #FatoluPeter #EmperorAnalytics #Project9 #DataForPractice

Cyclistic_data_visualization

Visualization data set for Google Data Analyst certification, Case Study 1

Context

Content

Acknowledgements

Inspiration