CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.
Ahoy, data enthusiasts! Join us for a hands-on workshop where you will hoist your sails and navigate through the Statistics Canada website, uncovering hidden treasures in the form of data tables. With the wind at your back, you’ll master the art of downloading these invaluable Stats Can datasets while braving the occasional squall of data cleaning challenges using Excel with your trusty captains Vivek and Lucia at the helm.
This dataset was created by Luis Lira
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
9130 Global exporters importers export import shipment records of Clean excel with prices, volume & current Buyer's suppliers relationships based on actual Global export trade database.
About this course Do you have messy data from multiple inconsistent sources, or open-responses to questionnaires? Do you want to improve the quality of your data by refining it and using the power of the internet? Open Refine is the perfect partner to Excel. It is a powerful, free tool for exploring, normalising and cleaning datasets, and extending data by accessing the internet through APIs. In this course we’ll work through the various features of Refine, including importing data, faceting, clustering, and calling remote APIs, by working on a fictional but plausible humanities research project. Learning Outcomes Download, install and run Open Refine Import data from csv, text or online sources and create projects Navigate data using the Open Refine interface Explore data by using facets Clean data using clustering Parse data using GREL syntax Extend data using Application Programming Interfaces (APIs) Export project for use in other applications Prerequisites The course has no prerequisites. Licence Copyright © 2021 Intersect Australia Ltd. All rights reserved.
It completely data clean excel file to attain accurate data analysis with proper visualization
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This project focuses on data mapping, integration, and analysis to support the development and enhancement of six UNCDF operational applications: OrgTraveler, Comms Central, Internal Support Hub, Partnership 360, SmartHR, and TimeTrack. These apps streamline workflows for travel claims, internal support, partnership management, and time tracking within UNCDF.Key Features and Tools:Data Mapping for Salesforce CRM Migration: Structured and mapped data flows to ensure compatibility and seamless migration to Salesforce CRM.Python for Data Cleaning and Transformation: Utilized pandas, numpy, and APIs to clean, preprocess, and transform raw datasets into standardized formats.Power BI Dashboards: Designed interactive dashboards to visualize workflows and monitor performance metrics for decision-making.Collaboration Across Platforms: Integrated Google Collab for code collaboration and Microsoft Excel for data validation and analysis.
This dataset was created by Mohammed Khaled Boyka
Released under Other (specified in description)
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.
Methods Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript.
Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form.
The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files.
The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively.
The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean.
The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file.
A Knowledge, Attitudes and Practices (KAP) survey was conducted in Ajuong Thok and Pamir Refugee Camps in October 2019 to determine the current Water, Sanitation and Hygiene (WASH) conditions as well as hygiene attitudes and practices within the households (HHs) surveyed. The assessment utilized a systematic random sampling method, and a total of 1,474 HHs (735 HHs in Ajuong Thok and 739 HHs in Pamir) were surveyed using mobile data collection (MDC) within a period of 21 days. Data was cleaned and analyzed in Excel. The summary of the results is presented in this report.
The findings show that the overall average number of liters of water per person per day was 23.4, in both Ajuong Thok and Pamir Camps, which was slightly higher than the recommended United Nations High Commissioner for Refugees (UNHCR) minimum standard of at least 20 liters of water available per person per day. This is a slight improvement from the 21 liters reported the previous year. The average HH size was six people. Women comprised 83% of the surveyed respondents and males 17%. Almost all the respondents were refugees, constituting 99.5% (n=1,466). The refugees were aware of the key health and hygiene practices, possibly as a result of routine health and hygiene messages delivered to them by Samaritan´s Purse (SP) and other health partners. Most refugees had knowledge about keeping the water containers clean, washing hands during critical times, safe excreta disposal and disease prevention.
Ajuong Thok and Pamir Refugee Camps
Households
All households in Ajuong Thok and Pamir Refugee Camps
Sample survey data [ssd]
Households were selected using systematic random sampling. Enumerators systematically walked through the camp block by block, row by row, in such a way as to pass each HH. Within blocks, enumerators started at one corner, then systematically used the sampling interval as they walked up and down each of the rows throughout the block, covering every block in Ajuong Thok and Pamir.
In each location, the first HH sampled in a block was generated using an Excel tool customized by UNHCR which generated a Random Start and Sampling Interval.
Face-to-face [f2f]
The survey questionnaire used to collect the data consists of the following sections: - Demographics - Water collection and storage - Drinking water hygiene - Hygiene - Sanitation - Messaging - Distribution (NFI) - Diarrhea prevalence, knowledge and health seeking behaviour - Menstrual hygiene
The data collected was uploaded to a server at the end of each day. IFormBuilder generated a Microsoft (MS) Excel spreadsheet dataset which was then cleaned and analyzed using MS Excel.
Given that SP is currently implementing a WASH program in Ajuong Thok and Pamir, the assessment data collected in these camps will not only serve as the endline for UNHCR 2018 programming but also as the baseline for 2019 programming.
Data was anonymized through decoding and local suppression.
The documentation covers Enterprise Survey panel datasets that were collected in Slovenia in 2009, 2013 and 2019.
The Slovenia ES 2009 was conducted between 2008 and 2009. The Slovenia ES 2013 was conducted between March 2013 and September 2013. Finally, the Slovenia ES 2019 was conducted between December 2018 and November 2019. The objective of the Enterprise Survey is to gain an understanding of what firms experience in the private sector.
As part of its strategic goal of building a climate for investment, job creation, and sustainable growth, the World Bank has promoted improving the business environment as a key strategy for development, which has led to a systematic effort in collecting enterprise data across countries. The Enterprise Surveys (ES) are an ongoing World Bank project in collecting both objective data based on firms' experiences and enterprises' perception of the environment in which they operate.
National
The primary sampling unit of the study is the establishment. An establishment is a physical location where business is carried out and where industrial operations take place or services are provided. A firm may be composed of one or more establishments. For example, a brewery may have several bottling plants and several establishments for distribution. For the purposes of this survey an establishment must take its own financial decisions and have its own financial statements separate from those of the firm. An establishment must also have its own management and control over its payroll.
As it is standard for the ES, the Slovenia ES was based on the following size stratification: small (5 to 19 employees), medium (20 to 99 employees), and large (100 or more employees).
Sample survey data [ssd]
The sample for Slovenia ES 2009, 2013, 2019 were selected using stratified random sampling, following the methodology explained in the Sampling Manual for Slovenia 2009 ES and for Slovenia 2013 ES, and in the Sampling Note for 2019 Slovenia ES.
Three levels of stratification were used in this country: industry, establishment size, and oblast (region). The original sample designs with specific information of the industries and regions chosen are included in the attached Excel file (Sampling Report.xls.) for Slovenia 2009 ES. For Slovenia 2013 and 2019 ES, specific information of the industries and regions chosen is described in the "The Slovenia 2013 Enterprise Surveys Data Set" and "The Slovenia 2019 Enterprise Surveys Data Set" reports respectively, Appendix E.
For the Slovenia 2009 ES, industry stratification was designed in the way that follows: the universe was stratified into manufacturing industries, services industries, and one residual (core) sector as defined in the sampling manual. Each industry had a target of 90 interviews. For the manufacturing industries sample sizes were inflated by about 17% to account for potential non-response cases when requesting sensitive financial data and also because of likely attrition in future surveys that would affect the construction of a panel. For the other industries (residuals) sample sizes were inflated by about 12% to account for under sampling in firms in service industries.
For Slovenia 2013 ES, industry stratification was designed in the way that follows: the universe was stratified into one manufacturing industry, and two service industries (retail, and other services).
Finally, for Slovenia 2019 ES, three levels of stratification were used in this country: industry, establishment size, and region. The original sample design with specific information of the industries and regions chosen is described in "The Slovenia 2019 Enterprise Surveys Data Set" report, Appendix C. Industry stratification was done as follows: Manufacturing – combining all the relevant activities (ISIC Rev. 4.0 codes 10-33), Retail (ISIC 47), and Other Services (ISIC 41-43, 45, 46, 49-53, 55, 56, 58, 61, 62, 79, 95).
For Slovenia 2009 and 2013 ES, size stratification was defined following the standardized definition for the rollout: small (5 to 19 employees), medium (20 to 99 employees), and large (more than 99 employees). For stratification purposes, the number of employees was defined on the basis of reported permanent full-time workers. This seems to be an appropriate definition of the labor force since seasonal/casual/part-time employment is not a common practice, except in the sectors of construction and agriculture.
For Slovenia 2009 ES, regional stratification was defined in 2 regions. These regions are Vzhodna Slovenija and Zahodna Slovenija. The Slovenia sample contains panel data. The wave 1 panel “Investment Climate Private Enterprise Survey implemented in Slovenia” consisted of 223 establishments interviewed in 2005. A total of 57 establishments have been re-interviewed in the 2008 Business Environment and Enterprise Performance Survey.
For Slovenia 2013 ES, regional stratification was defined in 2 regions (city and the surrounding business area) throughout Slovenia.
Finally, for Slovenia 2019 ES, regional stratification was done across two regions: Eastern Slovenia (NUTS code SI03) and Western Slovenia (SI04).
Computer Assisted Personal Interview [capi]
Questionnaires have common questions (core module) and respectfully additional manufacturing- and services-specific questions. The eligible manufacturing industries have been surveyed using the Manufacturing questionnaire (includes the core module, plus manufacturing specific questions). Retail firms have been interviewed using the Services questionnaire (includes the core module plus retail specific questions) and the residual eligible services have been covered using the Services questionnaire (includes the core module). Each variation of the questionnaire is identified by the index variable, a0.
Survey non-response must be differentiated from item non-response. The former refers to refusals to participate in the survey altogether whereas the latter refers to the refusals to answer some specific questions. Enterprise Surveys suffer from both problems and different strategies were used to address these issues.
Item non-response was addressed by two strategies: a- For sensitive questions that may generate negative reactions from the respondent, such as corruption or tax evasion, enumerators were instructed to collect the refusal to respond as (-8). b- Establishments with incomplete information were re-contacted in order to complete this information, whenever necessary. However, there were clear cases of low response.
For 2009 and 2013 Slovenia ES, the survey non-response was addressed by maximizing efforts to contact establishments that were initially selected for interview. Up to 4 attempts were made to contact the establishment for interview at different times/days of the week before a replacement establishment (with similar strata characteristics) was suggested for interview. Survey non-response did occur but substitutions were made in order to potentially achieve strata-specific goals. Further research is needed on survey non-response in the Enterprise Surveys regarding potential introduction of bias.
For 2009, the number of contacted establishments per realized interview was 6.18. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The relatively low ratio of contacted establishments per realized interview (6.18) suggests that the main source of error in estimates in the Slovenia may be selection bias and not frame inaccuracy.
For 2013, the number of realized interviews per contacted establishment was 25%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The number of rejections per contact was 44%.
Finally, for 2019, the number of interviews per contacted establishments was 9.7%. This number is the result of two factors: explicit refusals to participate in the survey, as reflected by the rate of rejection (which includes rejections of the screener and the main survey) and the quality of the sample frame, as represented by the presence of ineligible units. The share of rejections per contact was 75.2%.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Vrinda Store Data Analysis using Advance Excel, In this Dataset Cleaning the dataset and data mining remove the null value and using the Hlookup & Vlookup,Match,Index Pivot Tables and using the Chats to crated a beautiful DashBoard.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ABOUT THIS TOOL:
The Better Building’s Clean Energy for Low Income Communities Accelerator (CELICA) was launched in 2016 to help state and local partners across the nation meet their goals for increasing uptake of energy efficiency and renewable energy technologies in low and moderate income communities. As a part of the Accelerator, DOE created this Low-Income Energy Affordability Data (LEAD) Tool to assist partners with understanding their LMI community characteristics. This can be utilized for low income and moderate income energy policy and program planning, as it provides interactive state, county and city level worksheets with graphs and data including number of households at different income levels and numbers of homeowners versus renters. It provides a breakdown based on fuel type, building type, and construction year. It also provides average monthly energy expenditures and energy burden (percentage of income spent on energy).
HOW TO USE:
The LEAD tool can be used to support program design and goal setting, and they can be paired with other data to improve LMI community energy benchmarking and program evaluation. Datasets are available for all 50 states, census divisions, and tract levels. You will have to enable macros in MS Excel to interact with the data. A description of each of the files and what states are included in each U.S. Census Division can be found in the file "DESCRIPTION OF FILES".
For more information, visit: https://betterbuildingsinitiative.energy.gov/accelerators/clean-energy-low-income-communities
This page provides data for the 3rd Grade Reading Level Proficiency performance measure.The dataset includes the student performance results on the English/Language Arts section of the AzMERIT from the Fall 2017 and Spring 2018. Data is representive of students in third grade in public elementary schools in Tempe. This includes schools from both Tempe Elementary and Kyrene districts. Results are by school and provide the total number of students tested, total percentage passing and percentage of students scoring at each of the four levels of proficiency. The performance measure dashboard is available at 3.07 3rd Grade Reading Level Proficiency.Additional InformationSource: Arizona Department of EducationContact: Ann Lynn DiDomenicoContact E-Mail: Ann_DiDomenico@tempe.govData Source Type: Excel/ CSVPreparation Method: Filters on original dataset: within "Schools" Tab School District [select Tempe School District and Kyrene School District]; School Name [deselect Kyrene SD not in Tempe city limits]; Content Area [select English Language Arts]; Test Level [select Grade 3]; Subgroup/Ethnicity [select All Students] Remove irrelevant fields; Add Fiscal YearPublish Frequency: Annually as data becomes availablePublish Method: ManualData Dictionary
For any questions about this data please email me at jacob@crimedatatool.com. If you use this data, please cite it.Version 3 release notes:Adds data in the following formats: Excel.Changes project name to avoid confusing this data for the ones done by NACJD.Version 2 release notes:Adds data for 2017.Adds a "number_of_months_reported" variable which says how many months of the year the agency reported data.Property Stolen and Recovered is a Uniform Crime Reporting (UCR) Program data set with information on the number of offenses (crimes included are murder, rape, robbery, burglary, theft/larceny, and motor vehicle theft), the value of the offense, and subcategories of the offense (e.g. for robbery it is broken down into subcategories including highway robbery, bank robbery, gas station robbery). The majority of the data relates to theft. Theft is divided into subcategories of theft such as shoplifting, theft of bicycle, theft from building, and purse snatching. For a number of items stolen (e.g. money, jewelry and previous metals, guns), the value of property stolen and and the value for property recovered is provided. This data set is also referred to as the Supplement to Return A (Offenses Known and Reported). All the data was received directly from the FBI as text or .DTA files. I created a setup file based on the documentation provided by the FBI and read the data into R using the package asciiSetupReader. All work to clean the data and save it in various file formats was also done in R. For the R code used to clean this data, see here: https://github.com/jacobkap/crime_data. The Word document file available for download is the guidebook the FBI provided with the raw data which I used to create the setup file to read in data.There may be inaccuracies in the data, particularly in the group of columns starting with "auto." To reduce (but certainly not eliminate) data errors, I replaced the following values with NA for the group of columns beginning with "offenses" or "auto" as they are common data entry error values (e.g. are larger than the agency's population, are much larger than other crimes or months in same agency): 1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 99942. This cleaning was NOT done on the columns starting with "value."For every numeric column I replaced negative indicator values (e.g. "j" for -1) with the negative number they are supposed to be. These negative number indicators are not included in the FBI's codebook for this data but are present in the data. I used the values in the FBI's codebook for the Offenses Known and Clearances by Arrest data.To make it easier to merge with other data, I merged this data with the Law Enforcement Agency Identifiers Crosswalk (LEAIC) data. The data from the LEAIC add FIPS (state, county, and place) and agency type/subtype. If an agency has used a different FIPS code in the past, check to make sure the FIPS code is the same as in this data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Cardiovascular diseases dataset (clean)’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from https://www.kaggle.com/aiaiaidavid/cardio-data-dv13032020 on 13 February 2022.
--- Dataset description provided by original source is as follows ---
This data set is a cleaned up copy of cardio_train.csv which can be found at:
https://www.kaggle.com/sulianova/cardiovascular-disease-dataset
The original data set has been analyzed with Excel, correcting negative values, and removing outliers.
A number of features in the dataset are used to predict the presence or absence of a cardiovascular disease.
Below is a description of the features:
AGE: integer (years of age)
HEIGHT: integer (cm)
WEIGHT: integer (kg)
GENDER: categorical (1: female, 2: male)
AP_HIGH: systolic blood pressure, integer
AP_LOW: diastolic blood pressure, integer
CHOLESTEROL: categorical (1: normal, 2: above normal, 3: well above normal)
GLUCOSE: categorical (1: normal, 2: above normal, 3: well above normal)
SMOKE: categorical (0: no, 1: yes)
ALCOHOL: categorical (0: no, 1: yes)
PHYSICAL_ACTIVITY: categorical (0: no, 1: yes)
And the target variable:
CARDIO_DISEASE: categorical (0: no, 1: yes)
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
PECD Hydro modelling
This repository contains a more user-friendly version of the Hydro modelling data
released by ENTSO-E with their latest Seasonal Outlook.
The original URLs:
The original ENTSO-E hydropower dataset integrates the PECD (Pan-European Climate Database) released for the MAF 2019
As I did for the wind & solar data, the datasets released in this repository are only a more user- and machine-readable version of the original Excel files. As avid user of ENTSO-E data, with this repository I want to share my data wrangling efforts to make this dataset more accessible.
Data description
The zipped file contains 86 Excel files, two different files for each ENTSO-E zone.
In this repository you can find 5 CSV files:
PECD-hydro-capacities.csv
: installed capacitiesPECD-hydro-weekly-inflows.csv
: weekly inflows for reservoir and open-loop pumpingPECD-hydro-daily-ror-generation.csv
: daily run-of-river generationPECD-hydro-weekly-reservoir-min-max-generation.csv
: minimum and maximum weekly reservoir generationPECD-hydro-weekly-reservoir-min-max-levels.csv
: weekly minimum and maximum reservoir levelsCapacities
The file PECD-hydro-capacities.csv
contains: run of river capacity (MW) and storage capacity (GWh), reservoir plants capacity (MW) and storage capacity (GWh), closed-loop pumping/turbining (MW) and storage capacity and open-loop pumping/turbining (MW) and storage capacity. The data is extracted from the Excel files with the name starting with PEMM
from the following sections:
Run-of-River and pondage
, rows from 5 to 7, columns from 2 to 5Reservoir
, rows from 5 to 7, columns from 1 to 3Pump storage - Open Loop
, rows from 5 to 7, columns from 1 to 3Pump storage - Closed Loop
, rows from 5 to 7, columns from 1 to 3Inflows
The file PECD-hydro-weekly-inflows.csv
contains the weekly inflow (GWh) for the climatic years 1982-2017 for reservoir plants and open-loop pumping. The data is extracted from the Excel files with the name starting with PEMM
from the following sections:
Reservoir
, rows from 13 to 66, columns from 16 to 51Pump storage - Open Loop
, rows from 13 to 66, columns from 16 to 51Daily run-of-river
The file PECD-hydro-daily-ror-generation.csv
contains the daily run-of-river generation (GWh). The data is extracted from the Excel files with the name starting with PEMM
from the following sections:
Run-of-River and pondage
, rows from 13 to 378, columns from 15 to 51Miminum and maximum reservoir generation
The file PECD-hydro-weekly-reservoir-min-max-generation.csv
contains the minimum and maximum generation (MW, weekly) for reservoir-based plants for the climatic years 1982-2017. The data is extracted from the Excel files with the name starting with PEMM
from the following sections:
Reservoir
, rows from 13 to 66, columns from 196 to 231Reservoir
, rows from 13 to 66, columns from 232 to 267Minimum/Maximum reservoir levels
The file PECD-hydro-weekly-reservoir-min-max-levels.csv
contains the minimum/maximum reservoir levels at beginning of each week (scaled coefficient from 0 to 1). The data is extracted from the Excel files with the name starting with PEMM
from the following sections:
Reservoir
, rows from 14 to 66, column 12Reservoir
, rows from 14 to 66, column 13CHANGELOG
[2020/07/17] Added maximum generation for the reservoir
These data include the individual responses for the City of Tempe Annual Business Survey conducted by ETC Institute. These data help determine priorities for the community as part of the City's on-going strategic planning process. Averaged Business Survey results are used as indicators for city performance measures. The performance measures with indicators from the Business Survey include the following (as of 2023):1. Financial Stability and Vitality5.01 Quality of Business ServicesThe location data in this dataset is generalized to the block level to protect privacy. This means that only the first two digits of an address are used to map the location. When they data are shared with the city only the latitude/longitude of the block level address points are provided. This results in points that overlap. In order to better visualize the data, overlapping points were randomly dispersed to remove overlap. The result of these two adjustments ensure that they are not related to a specific address, but are still close enough to allow insights about service delivery in different areas of the city.Additional InformationSource: Business SurveyContact (author): Adam SamuelsContact E-Mail (author): Adam_Samuels@tempe.govContact (maintainer): Contact E-Mail (maintainer): Data Source Type: Excel tablePreparation Method: Data received from vendor after report is completedPublish Frequency: AnnualPublish Method: ManualData DictionaryMethods:The survey is mailed to a random sample of businesses in the City of Tempe. Follow up emails and texts are also sent to encourage participation. A link to the survey is provided with each communication. To prevent people who do not live in Tempe or who were not selected as part of the random sample from completing the survey, everyone who completed the survey was required to provide their address. These addresses were then matched to those used for the random representative sample. If the respondent’s address did not match, the response was not used.To better understand how services are being delivered across the city, individual results were mapped to determine overall distribution across the city.Processing and Limitations:The location data in this dataset is generalized to the block level to protect privacy. This means that only the first two digits of an address are used to map the location. When they data are shared with the city only the latitude/longitude of the block level address points are provided. This results in points that overlap. In order to better visualize the data, overlapping points were randomly dispersed to remove overlap. The result of these two adjustments ensure that they are not related to a specific address, but are still close enough to allow insights about service delivery in different areas of the city.The data are used by the ETC Institute in the final published PDF report.
https://www.skyquestt.com/privacy/https://www.skyquestt.com/privacy/
Global Clean label ingredients Market size was valued at USD 47.10 Billion in 2022 and is poised to grow from USD 50.17 Billion in 2023 to USD 88.03 Billion by 2031, at a CAGR of 6.5% during the forecast period (2024-2031).
Report Metric | Details |
Market size value in 2022 | USD 47.10 Billion |
Market size value in 2023 | USD 50.17 Billion |
Market size value in 2031 | USD 88.03 Billion |
Forecast Year | 2024-2031 |
Growth Rate (CAGR) | 6.5% |
Segments Covered |
|
Largest Market | North America |
Fastest Growing Market | Asia Pacific |
http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
Case Study 1- Bike Sharing Introduction: In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. There are two types of members are sharing bike differently! 1.) Annual members- who bought annual membership. 2.) Casual members- who bought or buying single-ride passes, full-day passes.
Phase_1- Ask- 1. Identify the business task- • How do annual members and casual riders use Cyclistic bikes differently? • Why would casual riders buy Cyclistic annual memberships? • How can Cyclistic use digital media to influence casual riders to become members? 2. Consider key stakeholders- Lily Moreno: The director of marketing and manager, Cyclistic marketing analytics team, Cyclistic executive team.
Phase_2- Prepare--
I downloaded and store it in my excel sheet, I am using only one month (April_2020) data, and using excel for solving task, I am also sorting and filtering my data according to requirement.
I downloaded data from public source and it’s fully reliable, unbiased. Data is also, complete, consistent and accurate.
Phase_3- Process—
• I downloaded 202004-divvy-tripdata.cvs data and I unzip the file and converted into .xls file, here I am using only April data because this case study is my first case study and only for my learning, so I want to keep it simple. I am using excel this time because I am more comfortable with excel then other tools. I also want to perform good analysis and don’t want to lost in multiple sheets & large dataset, in initial stage.
• I Checked the data errors, and corrected some errors, I also did some calculation in my sheet, and try to clean data, so I can use sheet appropriately, Phase_4- analyze— I organize my data, performed sorting and filtering multiple time as I needed, did some calculation, add few pivots table and try to analyze data properly, also try to Identify trends and relationships.
Phase_5- Share— • After completing my analysis, I used some charts to present my findings. First, I found Total count of ride is 16383 and annual members took 11552 count of ride what is 71% of total ride, and casual riders took only 29% of ride which is 4831.
• I also found that casual riders using ride for some times but members are taking ride anytime no matter if they need bike for long time or short time, they are taking ride without any second thought, because after buying annual pass they no need to pay (any extra money or) every time.
• Clark St & Elm St is a most bike rented point, people took 180 bikes from this station, and 132 are the annual member from that. Also, I found other station where we need more bikes. Likewise, we also can find station name where most people end their ride, so they have plenty space for bikes. Phase_6- Act— Feeling happy to share my finding with you, feeling little confident after completing my first case study.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Sample data for exercises in Further Adventures in Data Cleaning.