Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.
Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.
I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.
Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html
Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy
Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)
This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.
One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.
I am also interested to see what others did with this analysis - what were the findings and insights you found?
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
https://github.com/Tara10523/couresera.github.io/assets/54953888/7dd9c8ea-ee24-49cf-8bf4-dc921d19bcd8">
https://github.com/Tara10523/couresera.github.io/assets/54953888/5fc3a63b-2142-49a9-a020-f4eded582618">
https://github.com/Tara10523/couresera.github.io/assets/54953888/86f2ee28-8b9e-49fd-88c3-4064159c60da">
https://github.com/Tara10523/couresera.github.io/assets/54953888/773a416f-5abe-4aa3-8ee0-fd5bd1366e37">
https://github.com/Tara10523/couresera.github.io/assets/54953888/027e6041-0717-4d69-843f-76a93c6160ef">
BigQuery | Big Query data Cleaning
Tableau | Creating Visuals with Tableau
Sheets | Cleaning NULL Values , creating data tables
R studio | Organizing and cleaning data to create a visual code
SQL SSMS | Transform, clean and manipulate Data
Linkedin | Survey Poll
https://github.com/Tara10523/couresera.github.io/assets/54953888/41ffca7f-5c3e-42b2-bbf0-9c857ac81c16">
https://github.com/Tara10523/couresera.github.io/assets/54953888/6d476522-6300-4b34-9f76-31459a3d866e">
https://github.com/Tara10523/couresera.github.io/assets/54953888/2cae2c1c-6e85-43f2-9cab-a77a75d4b641">
https://github.com/Tara10523/couresera.github.io/assets/54953888/a6a0d731-e6e1-4793-8819-c7a2c867bc86">
Source for mock dating site pH7-Social-Dating-CMS source for mock social site tailwhip99 / social_media_site
https://github.com/Tara10523/couresera.github.io/assets/54953888/3d963ad2-7897-4a05-9c90-0395a3efc54d">
https://github.com/Tara10523/couresera.github.io/assets/54953888/62726f29-3cbc-4b1d-9136-cca4ddacb087">
https://github.com/Tara10523/couresera.github.io/assets/54953888/8d68e5c5-b9ea-48dc-bef0-d003f18bf270">
https://github.com/Tara10523/couresera.github.io/assets/54953888/80af72a5-7ed8-46f1-b56a-268cd623bd1e">
https://github.com/Tara10523/couresera.github.io/assets/54953888/6b9dfb44-cf2b-49ca-9d07-4fbe756e2985">
https://github.com/Tara10523/couresera.github.io/assets/54953888/10d3fcd9-84be-43b9-a907-807ada2e6497">
https://github.com/Tara10523/couresera.github.io/assets/54953888/f86217cd-1aff-498c-8eb1-6f08afc1d4c2">
https://github.com/Tara10523/couresera.github.io/assets/54953888/b9d607ad-930a-4829-b574-c427b82c7305">
https://github.com/Tara10523/couresera.github.io/assets/54953888/b0e53006-b0fa-436b-8c2b-752cdc31c448">
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems. Methods See eLife manuscript for full details. Below, we provide a summary of how the dataset was collected and processed.
Data Acquisition We limited our search to the 150 largest cities in the USA (by census population). To acquire raw data on street tree communities, we used a search protocol on both Google and Google Datasets Search (https://datasetsearch.research.google.com/). We first searched the city name plus each of the following: street trees, city trees, tree inventory, urban forest, and urban canopy (all combinations totaled 20 searches per city, 10 each in Google and Google Datasets Search). We then read the first page of google results and the top 20 results from Google Datasets Search. If the same named city in the wrong state appeared in the results, we redid the 20 searches adding the state name. If no data were found, we contacted a relevant state official via email or phone with an inquiry about their street tree inventory. Datasheets were received and transformed to .csv format (if they were not already in that format). We received data on street trees from 64 cities. One city, El Paso, had data only in summary format and was therefore excluded from analyses.
Data Cleaning All code used is in the zipped folder Data S5 in the eLife publication. Before cleaning the data, we ensured that all reported trees for each city were located within the greater metropolitan area of the city (for certain inventories, many suburbs were reported - some within the greater metropolitan area, others not). First, we renamed all columns in the received .csv sheets, referring to the metadata and according to our standardized definitions (Table S4). To harmonize tree health and condition data across different cities, we inspected metadata from the tree inventories and converted all numeric scores to a descriptive scale including “excellent,” “good”, “fair”, “poor”, “dead”, and “dead/dying”. Some cities included only three points on this scale (e.g., “good”, “poor”, “dead/dying”) while others included five (e.g., “excellent,” “good”, “fair”, “poor”, “dead”). Second, we used pandas in Python (W. McKinney & Others, 2011) to correct typos, non-ASCII characters, variable spellings, date format, units used (we converted all units to metric), address issues, and common name format. In some cases, units were not specified for tree diameter at breast height (DBH) and tree height; we determined the units based on typical sizes for trees of a particular species. Wherever diameter was reported, we assumed it was DBH. We standardized health and condition data across cities, preserving the highest granularity available for each city. For our analysis, we converted this variable to a binary (see section Condition and Health). We created a column called “location_type” to label whether a given tree was growing in the built environment or in green space. All of the changes we made, and decision points, are preserved in Data S9. Third, we checked the scientific names reported using gnr_resolve in the R library taxize (Chamberlain & Szöcs, 2013), with the option Best_match_only set to TRUE (Data S9). Through an iterative process, we manually checked the results and corrected typos in the scientific names until all names were either a perfect match (n=1771 species) or partial match with threshold greater than 0.75 (n=453 species). BGS manually reviewed all partial matches to ensure that they were the correct species name, and then we programmatically corrected these partial matches (for example, Magnolia grandifolia-- which is not a species name of a known tree-- was corrected to Magnolia grandiflora, and Pheonix canariensus was corrected to its proper spelling of Phoenix canariensis). Because many of these tree inventories were crowd-sourced or generated in part through citizen science, such typos and misspellings are to be expected. Some tree inventories reported species by common names only. Therefore, our fourth step in data cleaning was to convert common names to scientific names. We generated a lookup table by summarizing all pairings of common and scientific names in the inventories for which both were reported. We manually reviewed the common to scientific name pairings, confirming that all were correct. Then we programmatically assigned scientific names to all common names (Data S9). Fifth, we assigned native status to each tree through reference to the Biota of North America Project (Kartesz, 2018), which has collected data on all native and non-native species occurrences throughout the US states. Specifically, we determined whether each tree species in a given city was native to that state, not native to that state, or that we did not have enough information to determine nativity (for cases where only the genus was known). Sixth, some cities reported only the street address but not latitude and longitude. For these cities, we used the OpenCageGeocoder (https://opencagedata.com/) to convert addresses to latitude and longitude coordinates (Data S9). OpenCageGeocoder leverages open data and is used by many academic institutions (see https://opencagedata.com/solutions/academia). Seventh, we trimmed each city dataset to include only the standardized columns we identified in Table S4. After each stage of data cleaning, we performed manual spot checking to identify any issues.
Facebook
TwitterThis is a bike sharing data for a fictitious Cyclistic company. Actual data is based on Divvy, a Chicago bike sharing. The original data was cleaned using Postgres and Google Sheet This data has been cleaned to exclude data that's missing the station IDs and trips that has duration over 24 hours. Several columns was created to calculate trip durations and day of the week.
--- Finding duplicate , assumed duplicated ID means duplicated data. Result = no duplicated data found---
SELECT ride_id, COUNT(ride_id) AS ride_id_count FROM "Cyclistic" GROUP BY ride_id HAVING COUNT(ride_id)>1
--- Extract station table for data cleaning ----
SELECT DISTINCT start_station_name, start_station_id, end_station_id, end_station_name FROM "Cyclistic" ORDER BY start_station_name ;
Using Google Sheet Clean start_station_id code, clean missing station name, clean station id with extra .0, Assign id to NULL station data
---- Update main table with cleaned station name and id ---- UPDATE "Cyclistic" SET end_lng = lng FROM "cleaned_station_info" WHERE start_station_id = id;
---- Original data has latitude and longitude data that varies by small amount of decimal points. To make the data more uniform, the latitude and longitude were averaged out based on the station ID and use 8 decimal points for location accuracy. Data was then checked using Google Maps to make sure data is accurate to the nearest Divvy location in Chicago. ---
SELECT DISTINCT start_station_id, start_station_name, ROUND(AVG(start_lat)::DECIMAL,8) lat, ROUND(AVG(start_lng)::DECIMAL,8) lng FROM "Cyclistic" GROUP BY start_station_id, start_station_name ORDER BY start_station_id
--- Create a cleaned table for export excluding data that are less than 2 minutes and more than 24 hours. Based on data where duration less than 2 minutes, the ride always ends up at the same station. It is assumed that the rider canceled the ride or had trouble using the bike, therefore this data is excluded. For data more than 24 hours it's assumed that there's an error in docking in the bicycle or other problem with logging out of the ride. New table also exclude NULL data where start_station_name and end_station_name is missing ---
SELECT
*
FROM (
SELECT
ride_id, member_casual, rideable_type,
start_station_id, start_station_name,
end_station_id, end_station_name,
started_at, ended_at,
ended_at - started_at as duration,
start_lat, start_lng, end_lat, end_lng\
FROM "Cyclistic"\
WHERE start_station_name IS NOT NULL AND end_station_name IS NOT NULL ) AS duration_tbl\
WHERE duration >= '00:02' and duration <= '24:00' \
Facebook
TwitterFollow these instructions to use the Google Spreadsheet in your own activity. Begin by copying the Google Spreadsheet into your own Google Drive account. Prefill the username column for your students/participants. This will help keep the students from overwriting their peers' work.Change the editing permissions for the spreadsheet and share it with your students/participants.Demonstrate what data goes into each column from the Wikipedia page. Be sure to demonstrate how to find the latitude and longitude from Wikipedia. For the images, make sure the students copy the url that ends in the appropriate file type (jpg, png, etc).Be prepared for lots of mistakes. This is a great learning opportunity to talk about data quality. When the students are done completing the spreadsheet, check the spreadsheet for obvious errors. Pay special attention to the sign of the longitude. All of those values should be negative. Download the spreadsheet as a CSV.Log into your AGO Org account.Click on the Content tab -> Add item -> From my computerUpload the CSV and save it as a layer feature. Be sure to include a few tags (Mesoamerica, pyramid, Aztec, Maya would be good ones).Once the layer has been uploaded and converted into a feature layer, click the Settings button and check Delete Protection and save. From the feature layer Overview tab, change the share settings to share with your students. I usually set up a group (something like Mesoamerica), add the students to the group, then share the feature layer with that group.From here explore the data. Symbolize the data by culture to see if there are spatial patterns to their distribution. Symbolize the data by height to see if some cultures built taller pyramids or if taller pyramids were confined to certain regions. Students can also set up the pop-ups to use the image URL in the data.From here, students can save their maps, add additional data from ArcGIS Online, create story maps, etc. If you are looking for more great data, from your ArcGIS Online map, choose Add -> Add Layer from Web and paste the following into the URL. https://services1.arcgis.com/TQSFiGYN0xveoERF/arcgis/rest/services/MesoAmerican_civs/FeatureServerImage thumbnail is from Wikipedia.
Facebook
TwitterThis compensation data originated from a large google sheets document that became viral after it was posted on LinkedIn by Christen Nino De Guzman, Program Manager at Google. The LinkedIn post stated as follows:
"Let's talk #SalaryTransparency! A few weeks ago, I encouraged others to share their salaries in an anonymous Google form and now more than 58,000 people have come together to share details of their offers in a google sheet. Everything from sign on bonus, annual salary, diverse identity and age. Many fields that traditional sites like Glassdoor don’t include.
All of the responses were anonymous and publicly viewable. Huge shoutout to Brennan Pankiw for creating the survey! You can view responses here: https://lnkd.in/gPkYFQsN "
The Google sheets became extremely laggy and it crashed often as a result of the number of people accesing the document and its size. Therefore, I downloaded the raw data and took it upon myself to clean up the data and create an user friendly visualization of the compensation data. Using the skills I recently acquired during my Harvard Business Analytics Program, I began to identify all of the gaps, typos, variations of company names, and false data using tools such as R studio and Tableau.
DATA
To anyone interested in both the raw data and the cleaned data presented here, please reach out to Ricardo Ugas.
CONTACT
Linkedin: https://www.linkedin.com/in/ugas/ Resume/CV: https://bit.ly/3dIUmCo Email: ricardo.ugas.analytics@gmail.com ricardo.ugasgonzalez@postgrad.manchester.ac.uk ricardo.ugas@mail.analytics.hbs.edu ugasra@miamioh.edu Phone: +1 513 526 6598
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains non-confidential information about the trip details of the users of an imaginary bike company called Cyclistic. The dataset collectively contains approximately 5.9 million data. This is quite a huge amount of data to handle through spreadsheet applications such as MS Excel or Google Sheets, etc. Although it is possible to complete the entire analysis on MS Excel, I would recommend to use BigQuery or R to clean and analyze the data effectively.
The dataset contains following files
202107-divvy-tripdata.csv 202108-divvy-tripdata.csv 202109-divvy-tripdata.csv 202110-divvy-tripdata.csv 202111-divvy-tripdata.csv 202112-divvy-tripdata.csv 202201-divvy-tripdata.csv 202202-divvy-tripdata.csv 202203-divvy-tripdata.csv 202204-divvy-tripdata.csv 202205-divvy-tripdata.csv 202206-divvy-tripdata.csv
Facebook
TwitterCyclistic, a bike sharing company, wants to analyze their user data to find the main differences in behavior between their two types of users. The Casual Riders are those who pay for each ride and the Annual Member who pays a yearly subscription to the service.
Key objectives: 1.Identify The Business Task: - Cyclistic wants to analyze the data to find the key differences between Casual Riders and Annual Members. The goal of this project is to reach out to the casual riders and incentivize them into paying for the annual subscription.
Key objectives: 1. Download Data And Store It Appropriately - Downloaded the data as .csv files, which were saved in their own folder to keep everything organized. I then uploaded those files into BigQuery for cleaning and analysis. For this project I downloaded all of 2022 and up to May of 2023, as this is the most recent data that I have access to.
Identify How It's Organized
Sort and Filter The Data and Determine The Credibility of The Data
Key objectives: 1.Clean The Data and Prepare The Data For Analysis: -I used some simple SQL code in order to determine that no members were missing, that no information was repeated and that there were no misspellings in the data as well.
--no misspelling in either member or casual. This ensures that all results will not have missing information.
SELECT
DISTINCT member_casual
FROM
table
--This shows how many casual riders and members used the service, should add up to the numb of rows in the dataset SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM table GROUP BY member_type
--Shows that every bike has a distinct ID. SELECT DISTINCT ride_id FROM table
--Shows that there are no typos in the types of bikes, so no data will be missing from results. SELECT DISTINCT rideable_type FROM table
Key objectives: 1. Aggregate Your Data So It's Useful and Accessible -I had to write some SQL code so that I could combine all the data from the different files I had uploaded onto BigQuery
select rideable_type, started_at, ended_at, member_casual from table 1 union all select rideable_type, started_at, ended_at, member_casual from table 2 union all select rideable_type, started_at, ended_at, member_casual from table 3 union all select rideable_type, started_at, ended_at, member_casual from table 4 union all select rideable_type, started_at, ended_at, member_casual from table 5 union all select rideable_type, started_at, ended_at, member_casual from table 6 union all select rideable_type, started_at, ended_at, member_casual from table 7 union all select rideable_type, started_at, ended_at, member_casual from table 8 union all select rideable_type, started_at, ended_at, member_casual from table 9 union all select rideable_type, started_at, ended_at, member_casual from table10 union all select rideable_type, started_at, ended_at, member_casual from table 11 union all select rideable_type, started_at, ended_at, member_casual from table 12 union all select rideable_type, started_at, ended_at, member_casual from table 13 union all select rideable_type, started_at, ended_at, member_casual from table 14 union all select rideable_type, started_at, ended_at, member_casual from table 15 union all select rideable_type, started_at, ended_at, member_casual from table 16 union all select rideable_type, started_at, ended_at, member_casual from table 17
--This shows how many casual and annual members used bikes SELECT member_casual AS member_type, COUNT(*) AS total_riders FROM Aggregate Data Table GROUP BY member_type
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Source - All data was collected from the NBA.com website and Basketball-Reference.com - To view the raw data, and steps I took to clean and format it, you can click the link below https://docs.google.com/spreadsheets/d/1bJnc1n-pXVjtqKul1NnjOq0mYl9-7FZy_CbM2gTmTLA/edit?usp=sharing
Context All data is from the 2022-2023, 82 game regular season
Inspiration I gathered this data to perform an analysis with the goal being to answer the questions: - From where did the Boston Celtics shoot the highest field goal percent? - When did the Boston Celtics shoot the highest field goal percent? - Under what conditions did the Boston Celtics shoot the highest field goal percent?
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
This data is a transformed version of the following dataset.
Mobile Legends Match Results Provided by MUHAMMAD RIZQI NUR
Huge shoutout to Rizqi and his wonderful work, providing this dataset. To see how he get this dataset on the first place, please see the original dataset (Provided by link above)
Aim :
- This dataset is to analyze the draft picks using tableau to answer these questions
1. " If I have to play in a party of 2, which draft has highest chance of winning ? "
2. " If I have to play in a party of 3, which draft has the highest chance of winning ?"
3. " What are some heroes combinations that I need to avoid ?"
Additional Aim : 1. I uploaded this dataset as a way for me to get some public reviews of how I clean the data. 2. To share my step by step process to get the final output.
Notes & Caution : 1. I tried to clean the dataset using python, but get stuck midway on sorting problem. Link , Please excuse my lack of competence. 2. Hence, I start cleaning the data manually using spreadsheet. 3. Due to the google spreadsheet maximum cell limitation, I have to split the work process in 2 different files. 4. A lot of value paste copying to get the final output file 'MLBB Draft Sorted Cleaned.xlsx' so please use the formulas with caution.
Transformation Steps : 1. Remove duplicates by column 'battleId'. 2. Find and replace "Chang'e" into 'Change' (Changing double quote to single quote, and removing single quote on the original string) because it interferes with the Regex formula. 3. Split the data into winning drafts and losing drafts instead of 'left' and 'right' and 'win status' for each match (Win = Left side is winning draft, Lose = Right side is winning draft). 4. Use Regex to split list into individual cell for each hero pick 5. Create a sheet to list the names of individual heroes. 6. Changing each name into number id (To speed up the calculation, and ensure a better sorting) 7. Transpose into wide data (battleid as the colname, hero pick as rows). 8. Sort each column in ascending order. 9 Transpose back to long data. 10. Change back the number id into respective hero name. 11. Repeat process for the losing draft. 12. Combine both winning draft and losing draft into a single sheet.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
🇺🇸 Alphabet Inc. (GOOGL) Comprehensive Financial Dataset
Welcome to the GOOGL Financial Dataset! This dataset provides clear and easy-to-use quarterly financial statements (income statement, balance sheet, and cash flow) along with daily historical stock prices.
As a data engineer double majored with economics, I'll personally analyze and provide constructive feedback on all your work using this dataset. Let's dive in and explore Google's financial journey together!
This dataset offers a unique blend of long-term market performance and detailed financial metrics:
Whether you're building predictive models, performing deep-dive financial analysis, or exploring the evolution of one of the world's most innovative tech giants, this dataset is your go-to resource for clean, well-organized, and rich financial data.
For a more comprehensive financial analysis, pair this dataset with my other Kaggle dataset:
👉 Google (Alphabet Inc.) Daily News — 2000 to 2025
That dataset includes:
Combining both datasets unlocks powerful analysis such as:
Together, they give you everything you need for news + financial signal modeling.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
It all started during last #StayAtHome during 2020's pandemic: some neighbors worried about trash in Montevideo's container.
The goal is to automatically detect clean from dirty containers to ask for maintenance.
Want to know more about the entire process? Checkout this thread on how it began, and this other with respect to version 6 update process.
Data is splitted in training/testing split, they are independent. However, each split contains several near duplicate images (typicaly, same container from different perspectives or days). Image sizes differ a lot among them.
There are four major sources: * Images taken from Google Street View, they are 600x600 pixels, automatically collected through its API. * Images contributed by individual persons, most of which I took my self. * Images taken from social networks (Twitter & Facebook) and news. * Images contributed by pormibarrio.uy - 17-11-2020
Images were took from green containers, the most popular in Montevideo, but also widely used in some other cities.
Current version (clean-dirty-garbage-containers-V6) is also available here or you can download it as follows:
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1
/p')&id=1mdfJoOrO6MeTc3eMEjIDkAKlwK9bUFg6" -O clean-dirty-garbage-containers-V6.zip && rm -rf /tmp/cookies.txt
This is specially useful if you want to download it in Google Colab.
This repo contains the code used during its building and documentation process, including the baselines for the purposed tasks.
Since this is a hot topic in Montevideo, specially nowadays, with elections next week, it catch some attention from local press:
Thanks to every single person who give me images from their containers. Special thanks to my friend Diego, whose idea of using google street view as a source of data really contributed to increase the dataset. And finally to my wife, who supported me during this project and contributed a lot to this dataset.
If you use these data in a publication, presentation, or other research project or product, please use the following citation:
Laguna, Rodrigo. 2021. Clean dirty containers in Montevideo - Version 6.1. url: https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo
@dataset{RLaguna-clean-dirty:2021,
author = {Rodrigo Laguna},
title = {Clean dirty containers in Montevideo},
year = {2021},
url = {https://www.kaggle.com/rodrigolaguna/clean-dirty-containers-in-montevideo},
version = {6.1}
}
I'm on twitter, @ro_laguna_ or write to me r.laguna.queirolo at outlook.com
12-09-2020: V3 - Include more training (+676) & testing (+64) samples:
21-12-2020: V4 - Include more training (+367) & testing (+794) samples, including ~400...
Facebook
TwitterSyria belongs to the third world countries, and it is one of the countries that have recently entered technology (interest in the Internet, social media platforms, scientific research, etc.) I obtained this data through a survey I conducted on Facebook with Syrian citizens, whether residing inside or outside Syrian territory. The number of those who conducted the survey is few, soon the survey will be more comprehensive and communication will be done through all available social media platforms
The survey was for 10 days, using Google Sheet, 60 samples were obtained, some answers were not complete, which will force the researcher to clean the data
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
I am still a beginner , so I would like to take some experience from here and help me to improve and increase my skills with data analysis
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This project is a Data Analytical assignment to analyze the data of a customer of KPMG, The Spyrocket Central, who deals in the selling of various brands of bicycle in all the four states of Australia. They had issues with stagnating sales and needed help in the following queries: • What are the trends in the underlying data? • Which customer segment has the highest customer value? • What do you propose should be Sprocket Central Pty Ltd ’s marketing and growth strategy? • What additional external datasets may be useful to obtain greater insights into customer preferences and propensity to purchase the products?
The customers dataset consisted of the following data. • Transaction: Consisted Of data of transactions in the year 2017 along with transaction id, product id, brand, product class, product size, transaction date, product cost etc. • New customer list and Customer Demographics and consisting of addresses, job industry, customer names, job title, gender, wealth segment etc.
The dataset was thoroughly cleaned and formatted due to the following data inconsistencies using Spreadsheets. •Transactions Sheet: - | column with issues online order - empty brand- empty product size -empty product class -empty product line - empty standard cost- empty product first sold- empty
• Customer Demographic Sheet: - column with issues gender- empty DOB- Inconsistent data job industry category- empty
• Customer Address Sheet: - column with issues states- abbreviations of states in place of state name
After thoroughly analyzing the clean data, the following major points were paid attention to derive insights and ameliorate the business strategy.
• State-wise analysis to bring out the states with max and min sales
• Most sold bikes according to types i.e., mountain bike, road etc.
• Customers in different job industries
• Customers in different age groups
• Customers from different wealth segments.
Insights of the analysis are presented in the presentation below. https://docs.google.com/presentation/d/1ECUmK4rGncjPVrRexL4kWPPIOFjdoXIqJkegYtC_wrk/edit?usp=share_link
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Johns Hopkins University has made an excellent dashboard using the affected case data. Data is extracted from the google sheets associated and made available here.
This data is available as CSV files in the Johns Hopkins Github repository. Please refer to the Github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.
2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC
This dataset has daily level information on the number of affected cases, deaths, and recovery from 2019 novel coronavirus. Please note that this is a time-series data and so the number of cases on any given day is the cumulative number.
The data is available from 22 Jan 2020 to 28 May 2020.
The main file in this dataset is covid_19_data_cleaned.csv and the detailed descriptions are below. covid_19_data_cleaned.csv
Facebook
TwitterFrom World Health Organization - On 31 December 2019, WHO was alerted to several cases of pneumonia in Wuhan City, Hubei Province of China. The virus did not match any other known virus. This raised concern because when a virus is new, we do not know how it affects people.
So daily level information on the affected people can give some interesting insights when it is made available to the broader data science community.
Johns Hopkins University has made an excellent dashboard using the affected cases data. Data is extracted from the google sheets associated and made available here.
Edited: Now data is available as csv files in the Johns Hopkins Github repository. Please refer to the github repository for the Terms of Use details. Uploading it here for using it in Kaggle kernels and getting insights from the broader DS community.
2019 Novel Coronavirus (2019-nCoV) is a virus (more specifically, a coronavirus) identified as the cause of an outbreak of respiratory illness first detected in Wuhan, China. Early on, many of the patients in the outbreak in Wuhan, China reportedly had some link to a large seafood and animal market, suggesting animal-to-person spread. However, a growing number of patients reportedly have not had exposure to animal markets, indicating person-to-person spread is occurring. At this time, it’s unclear how easily or sustainably this virus is spreading between people - CDC
This dataset has daily level information on the number of affected cases, deaths and recovery from 2019 novel coronavirus. Please note that this is a time series data and so the number of cases on any given day is the cumulative number.
The data is available from 22 Jan, 2020.
Province/State - Province or state of the observation (Could be empty when missing) CountryReg - Country of observation Last Update - Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it) Confirmed - Cumulative number of confirmed cases till that date Deaths - Cumulative number of of deaths till that date Recovered - Cumulative number of recovered cases till that date Lon Lat week - Week Number (1 To 52) Weeks Per Year
Johns Hopkins University for making the data available for educational and academic research purposes.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a total of 16737 unique animes. The reason for creating this dataset is the requirement of a clean dataset of Anime. I found a few datasets on anime, most of the datasets had the major anime but some dataset 1) doesn't have 'Genre' or 'Synopsis' of anime. For content-based recommendation, it is helpful if we have more information about anime 2) have duplicate data 3) missing data is represented by different notations.
Anime_id :anime Id (as per myanimelist.net)
Title : name of anime
Genre :Main genre
Synopsis :Brief Discription
Type
Producer
Studio
Rating :Rating of anime as pe myanimelist.net/
ScoredBy : Total no user scored given anime
Popularity :Rank of anime based on popularity
Members :No of members added given anime on their list
Episodes : No. of episodes
Source
Aired
Link
This dataset is a combination of 2 datasets
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Chicken Republic Lagos Sales Dataset – Fast Food Sales Analysis (NG)
📝 Dataset Overview: This dataset captures real-world retail transaction data from Chicken Republic outlets in Lagos, Nigeria. It provides detailed insights into fast food sales performance across different product categories, with columns that track revenue, quantity sold, and profit.
Ideal for anyone looking to:
Practice sales analysis
Build business intelligence dashboards
Forecast product performance
Analyze profit margins and pricing
🔍 Dataset Features: Column Name Description Date Date of each transaction Location Outlet or branch where the sale occurred Product Category Category of the product sold (e.g., Meals, Drinks, Snacks) Product Name of the specific product Quantity Sold Number of units sold Unit Price (NGN) Price per unit in Nigerian Naira Total Sales (NGN) Quantity × Unit Price Profit (NGN) Estimated profit from the sale
🎯 Use Cases: Build Power BI dashboards with slicers and filters by product category
Perform profitability analysis per outlet
Create forecast models to predict sales
Analyze customer preferences based on high-selling items
Create data storytelling visuals for retail presentations
🛠 Tools You Can Use: Excel / Google Sheets
Power BI / Tableau
Python (Pandas, Matplotlib, Seaborn)
SQL for querying sales trends
👤 Creator: Fatolu Peter (Emperor Analytics) Working actively on real-world retail, healthcare, and social media analytics. This dataset is part of my ongoing data project series (#Project 9 and counting!) 🚀
✅ LinkedIn Post: 🚨 New Dataset Drop for Analysts & BI Enthusiasts 📊 Chicken Republic Lagos Sales Dataset – Now on Kaggle! 🔗 Access here
Whether you’re a student, analyst, or business developer—this dataset gives you a clean structure for performing end-to-end sales analysis:
✅ Track daily sales ✅ Visualize profit by product category ✅ Create Power BI dashboards ✅ Forecast best-selling items
Columns include: Date | Location | Product | Quantity Sold | Unit Price | Total Sales | Profit
Built with love from Lagos 🧡 Let’s drive real insights with real data. Tag me if you build something amazing—I’d love to see it!
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I created these files and analysis as part of working on a case study for the Google Data Analyst certificate.
Question investigated: Do annual members and casual riders use Cyclistic bikes differently? Why do we want to know?: Knowing bike usage/behavior by rider type will allow the Marketing, Analytics, and Executive team stakeholders to design, assess, and approve appropriate strategies that drive profitability.
I used the script noted below to clean the files and then added some additional steps to create the visualizations to complete my analysis. The additional steps are noted in corresponding R Markdown file for this data set.
Files: most recent 1 year of data available, Divvy_Trips_2019_Q2.csv, Divvy_Trips_2019_Q3.csv, Divvy_Trips_2019_Q4.csv, Divvy_Trips_2020_Q1.csv Source: Downloaded from https://divvy-tripdata.s3.amazonaws.com/index.html
Data cleaning script: followed this script to clean and merge files https://docs.google.com/document/d/1gUs7-pu4iCHH3PTtkC1pMvHfmyQGu0hQBG5wvZOzZkA/copy
Note: Combined data set has 3,876,042 rows, so you will likely need to run R analysis on your computer (e.g., R Console) rather than in the cloud (e.g., RStudio Cloud)
This was my first attempt to conduct an analysis in R and create the R Markdown file. As you might guess, it was an eye-opening experience, with both exciting discoveries and aggravating moments.
One thing I have not yet been able to figure out is how to add a legend to the map. I was able to get a legend to appear on a separate (empty) map, but not on the map you will see here.
I am also interested to see what others did with this analysis - what were the findings and insights you found?