Introduction
Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
ride_id: It is a distinct identifier assigned to each individual ride. rideable_type: This column indicates the type of bikes used for each ride. started_at: This column denotes the timestamp when a particular ride began. ended_at: This column represents the timestamp when a specific ride concluded. start_station_name: This column contains the name of the station where the bike ride originated. start_station_id: This column represents the unique identifier for the station where the bike ride originated. end_station_name: This column contains the name of the station where the bike ride concluded. end_station_id: This column represents the unique identifier for the station where the bike ride concluded. start_lat: This column denotes the latitude coordinate of the starting point of the bike ride. start_lng: This column denotes the longitude coordinate of the starting point of the bike ride. end_lat: This column denotes the latitude coordinate of the ending point of the bike ride. end_lng: This column denotes the longitude coordinate of the ending point of the bike ride. member_casual: This column indicates whether the rider is a member or a casual user.
Data Science Platform Market Size 2025-2029
The data science platform market size is forecast to increase by USD 763.9 million at a CAGR of 40.2% between 2024 and 2029.
The market is experiencing significant growth, driven by the integration of artificial intelligence (AI) and machine learning (ML). This enhancement enables more advanced data analysis and prediction capabilities, making data science platforms an essential tool for businesses seeking to gain insights from their data. Another trend shaping the market is the emergence of containerization and microservices in platforms. This development offers increased flexibility and scalability, allowing organizations to efficiently manage their projects.
However, the use of platforms also presents challenges, particularly In the area of data privacy and security. Ensuring the protection of sensitive data is crucial for businesses, and platforms must provide strong security measures to mitigate risks. In summary, the market is witnessing substantial growth due to the integration of AI and ML technologies, containerization, and microservices, while data privacy and security remain key challenges.
What will be the Size of the Data Science Platform Market During the Forecast Period?
Request Free Sample
The market is experiencing significant growth due to the increasing demand for advanced data analysis capabilities in various industries. Cloud-based solutions are gaining popularity as they offer scalability, flexibility, and cost savings. The market encompasses the entire project life cycle, from data acquisition and preparation to model development, training, and distribution. Big data, IoT, multimedia, machine data, consumer data, and business data are prime sources fueling this market's expansion. Unstructured data, previously challenging to process, is now being effectively managed through tools and software. Relational databases and machine learning models are integral components of platforms, enabling data exploration, preprocessing, and visualization.
Moreover, Artificial intelligence (AI) and machine learning (ML) technologies are essential for handling complex workflows, including data cleaning, model development, and model distribution. Data scientists benefit from these platforms by streamlining their tasks, improving productivity, and ensuring accurate and efficient model training. The market is expected to continue its growth trajectory as businesses increasingly recognize the value of data-driven insights.
How is this Data Science Platform Industry segmented and which is the largest segment?
The industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.
Deployment
On-premises
Cloud
Component
Platform
Services
End-user
BFSI
Retail and e-commerce
Manufacturing
Media and entertainment
Others
Sector
Large enterprises
SMEs
Geography
North America
Canada
US
Europe
Germany
UK
France
APAC
China
India
Japan
South America
Brazil
Middle East and Africa
By Deployment Insights
The on-premises segment is estimated to witness significant growth during the forecast period.
On-premises deployment is a traditional method for implementing technology solutions within an organization. This approach involves purchasing software with a one-time license fee and a service contract. On-premises solutions offer enhanced security, as they keep user credentials and data within the company's premises. They can be customized to meet specific business requirements, allowing for quick adaptation. On-premises deployment eliminates the need for third-party providers to manage and secure data, ensuring data privacy and confidentiality. Additionally, it enables rapid and easy data access, and keeps IP addresses and data confidential. This deployment model is particularly beneficial for businesses dealing with sensitive data, such as those in manufacturing and large enterprises. While cloud-based solutions offer flexibility and cost savings, on-premises deployment remains a popular choice for organizations prioritizing data security and control.
Get a glance at the Data Science Platform Industry report of share of various segments. Request Free Sample
The on-premises segment was valued at USD 38.70 million in 2019 and showed a gradual increase during the forecast period.
Regional Analysis
North America is estimated to contribute 48% to the growth of the global market during the forecast period.
Technavio's analysts have elaborately explained the regional trends and drivers that shape the market during the forecast period.
For more insights on the market share of various regions, Request F
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Hello! Welcome to the Capstone project I have completed to earn my Data Analytics certificate through Google. I chose to complete this case study through RStudio desktop. The reason I did this is that R is the primary new concept I learned throughout this course. I wanted to embrace my curiosity and learn more about R through this project. In the beginning of this report I will provide the scenario of the case study I was given. After this I will walk you through my Data Analysis process based on the steps I learned in this course:
The data I used for this analysis comes from this FitBit data set: https://www.kaggle.com/datasets/arashnic/fitbit
" This dataset generated by respondents to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. "
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Scientific investigation is of value only insofar as relevant results are obtained and communicated, a task that requires organizing, evaluating, analysing and unambiguously communicating the significance of data. In this context, working with ecological data, reflecting the complexities and interactions of the natural world, can be a challenge. Recent innovations for statistical analysis of multifaceted interrelated data make obtaining more accurate and meaningful results possible, but key decisions of the analyses to use, and which components to present in a scientific paper or report, may be overwhelming. We offer a 10-step protocol to streamline analysis of data that will enhance understanding of the data, the statistical models and the results, and optimize communication with the reader with respect to both the procedure and the outcomes. The protocol takes the investigator from study design and organization of data (formulating relevant questions, visualizing data collection, data exploration, identifying dependency), through conducting analysis (presenting, fitting and validating the model) and presenting output (numerically and visually), to extending the model via simulation. Each step includes procedures to clarify aspects of the data that affect statistical analysis, as well as guidelines for written presentation. Steps are illustrated with examples using data from the literature. Following this protocol will reduce the organization, analysis and presentation of what may be an overwhelming information avalanche into sequential and, more to the point, manageable, steps. It provides guidelines for selecting optimal statistical tools to assess data relevance and significance, for choosing aspects of the analysis to include in a published report and for clearly communicating information.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Prior to statistical analysis of mass spectrometry (MS) data, quality control (QC) of the identified biomolecule peak intensities is imperative for reducing process-based sources of variation and extreme biological outliers. Without this step, statistical results can be biased. Additionally, liquid chromatography–MS proteomics data present inherent challenges due to large amounts of missing data that require special consideration during statistical analysis. While a number of R packages exist to address these challenges individually, there is no single R package that addresses all of them. We present pmartR, an open-source R package, for QC (filtering and normalization), exploratory data analysis (EDA), visualization, and statistical analysis robust to missing data. Example analysis using proteomics data from a mouse study comparing smoke exposure to control demonstrates the core functionality of the package and highlights the capabilities for handling missing data. In particular, using a combined quantitative and qualitative statistical test, 19 proteins whose statistical significance would have been missed by a quantitative test alone were identified. The pmartR package provides a single software tool for QC, EDA, and statistical comparisons of MS data that is robust to missing data and includes numerous visualization capabilities.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data analysis can be accurate and reliable only if the underlying assumptions of the used statistical method are validated. Any violations of these assumptions can change the outcomes and conclusions of the analysis. In this study, we developed Smart Data Analysis V2 (SDA-V2), an interactive and user-friendly web application, to assist users with limited statistical knowledge in data analysis, and it can be freely accessed at https://jularatchumnaul.shinyapps.io/SDA-V2/. SDA-V2 automatically explores and visualizes data, examines the underlying assumptions associated with the parametric test, and selects an appropriate statistical method for the given data. Furthermore, SDA-V2 can assess the quality of research instruments and determine the minimum sample size required for a meaningful study. However, while SDA-V2 is a valuable tool for simplifying statistical analysis, it does not replace the need for a fundamental understanding of statistical principles. Researchers are encouraged to combine their expertise with the software’s capabilities to achieve the most accurate and credible results.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Global Manufacturing Analytics Market size was valued at USD 10.44 Billion in 2024 and is projected to reach USD 44.76 Billion by 2031, growing at a CAGR of 22.01% from 2024 to 2031.
Global Manufacturing Analytics Market Drivers
Growing Adoption of Industrial Internet of Things (IIoT): As more sensors and connected devices are used in manufacturing processes, massive volumes of data are generated. This increases the demand for analytics solutions in order to extract useful insights from the data.
Demand for Operational Efficiency: In order to increase output, cut expenses, and minimize downtime, manufacturers strive to improve their operations. Real-time operational data analysis is made possible by analytics systems, which promote proactive decision-making and process enhancements.
Growing Complexity in production Processes: With numerous steps, variables, and dependencies, modern production processes are getting more and more complicated. These intricate processes can be analyzed and optimized with the help of analytics technologies to increase productivity and quality.
Emphasis on Predictive Maintenance: To reduce downtime and prevent equipment breakdowns, manufacturers are implementing predictive maintenance procedures. By using machine learning algorithms to evaluate equipment data and forecast maintenance requirements, manufacturing analytics systems can optimize maintenance schedules and minimize unscheduled downtime.
Quality Control and Compliance Requirements: The use of analytics solutions in manufacturing is influenced by strict quality control guidelines and legal compliance obligations. Manufacturers may ensure compliance with quality standards and laws by using these technologies to monitor and evaluate product quality metrics in real-time.
Demand for Supply Chain Optimization: In an effort to increase productivity, save expenses, and boost customer happiness, manufacturers are putting more and more emphasis on supply chain optimization. Analytics tools give manufacturers insight into the workings of their supply chains, allowing them to spot bottlenecks, maximize inventory, and enhance logistical procedures.
Technological Developments in Big Data and Analytics: The production of analytics solutions is becoming more innovative due to advances in machine learning, artificial intelligence, and big data analytics. Thanks to these developments, manufacturers can now analyze massive amounts of data in real time, derive insights that can be put into practice, and improve their operations continuously.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Analysis of ‘Main steps in the thermal refurbishment process ’ provided by Analyst-2 (analyst-2.ai), based on source dataset retrieved from http://data.europa.eu/88u/dataset/719932f5-e477-4f19-9219-716609a540d7 on 13 January 2022.
--- Dataset description provided by original source is as follows ---
Main steps in the thermal refurbishment process
--- Original source retains full ownership of the source dataset ---
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The folder named “submission” contains the following:
ijgis.yml
: This file lists all the Python libraries and dependencies required to run the code.ijgis.yml
file to create a Python project and environment. Ensure you activate the environment before running the code.pythonProject
folder contains several .py
files and subfolders, each with specific functionality as described below..png
file for each column of the raw gaze and IMU recordings, color-coded with logged events..csv
files.overlapping_sliding_window_loop.py
.plot_labels_comparison(df, save_path, x_label_freq=10, figsize=(15, 5))
in line 116 visualizes the data preparation results. As this visualization is not used in the paper, the line is commented out, but if you want to see visually what has been changed compared to the original data, you can comment out this line..csv
files in the results folder.This part contains three main code blocks:
iii. One for the XGboost code with correct hyperparameter tuning:
Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically test the confidence threshold of
Note: Please read the instructions for each block carefully to ensure that the code works smoothly. Regardless of which block you use, you will get the classification results (in the form of scores) for unseen data. The way we empirically calculated the confidence threshold of the model (explained in the paper in Section 5.2. Part II: Decoding surveillance by sequence analysis) is given in this block in lines 361 to 380.
.csv
file containing inferred labels.The data is licensed under CC-BY, the code is licensed under MIT.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Recent technological advances have made it possible to carry out high-throughput metabonomics studies using gas chromatography coupled with time-of-flight mass spectrometry. Large volumes of data are produced from these studies and there is a pressing need for algorithms that can efficiently process and analyze data in a high-throughput fashion as well. We present an Automated Data Analysis Pipeline (ADAP) that has been developed for this purpose. ADAP consists of peak detection, deconvolution, peak alignment, and library search. It allows data to flow seamlessly through the analysis steps without any human intervention and features two novel algorithms in the analysis. Specifically, clustering is successfully applied in deconvolution to resolve coeluting compounds that are very common in complex samples and a two-phase alignment process has been implemented to enhance alignment accuracy. ADAP is written in standard C++ and R and uses parallel computing via Message Passing Interface for fast peak detection and deconvolution. ADAP has been applied to analyze both mixed standards samples and serum samples and identified and quantified metabolites successfully. ADAP is available at http://www.du-lab.org.
The role of Data Science and AI for predicting the decline of professionals in the recruitment process: augmenting decision-making in human resources management Features Description: Declined: Variable to be predict, where value 0 means that the candi- date continued in the recruit- ment process until the hiring, and value 1 implies the candi- date’s declination from recruit- ment process. ValueClient: The total amount the customer plan to pay by the hired candidate. The value 0 means that client yet did not define a value to pay the candidate. Values must be greater than or equal to 0. ExtraCost: Extra cost the customer has to pay to hire the candidate. Values must be greater than or equal to 0. ValueResources: Requested value by the candidate to work. The value 0 means that the candidate did not request a salary amount yet an this value will be negotiate later. Values must be greater than or equal to 0. Net: The difference between the “ValueClient”, yearly taxes and “ValueResources”. Negative values mean that the amount the client plans to pay the candidate has not yet been defined and is still open for negotiation. DaysOnContact: Number of days that the candidate is in the “Contact” step of the recruitment process. Values must be greater than or equal to 0. DaysOnInterview: Number of days that the candidate is in the “Interview” step of the recruitment process. Values must be greater than or equal to 0. DaysOnSendCV: Number of days that the candidate is in the “Send CV” step of the recruitment process. Values must be greater than or equal to 0. DaysOnReturn: Number of days that the candidate is in the “Return” step of the recruitment process. Values must be greater than or equal to 0. DaysOnCSchedule: Number of days that the candidate is in the “C. Schedule” step of the recruitment process. Values must be greater than or equal to 0. DaysOnCRealized: Number of days that the candidate is in the “C. Realized” step of the recruitment process. Values must be greater than or equal to 0. ProcessDuration: Duration of entire recruitment process in days. Values must be greater than or equal to 0
The U.S. Geological Survey (USGS) Water Resources Mission Area (WMA) is working to address a need to understand where the Nation is experiencing water shortages or surpluses relative to the demand for water need by delivering routine assessments of water supply and demand and an understanding of the natural and human factors affecting the balance between supply and demand. A key part of these national assessments is identifying long-term trends in water availability, including groundwater and surface water quantity, quality, and use. This data release contains Mann-Kendall monotonic trend analyses for 18 observed annual and monthly streamflow metrics at 6,347 U.S. Geological Survey streamgages located in the conterminous United States, Alaska, Hawaii, and Puerto Rico. Streamflow metrics include annual mean flow, maximum 1-day and 7-day flows, minimum 7-day and 30-day flows, and the date of the center of volume (the date on which 50% of the annual flow has passed by a gage), along with the mean flow for each month of the year. Annual streamflow metrics are computed from mean daily discharge records at U.S. Geological Survey streamgages that are publicly available from the National Water Information System (NWIS). Trend analyses are computed using annual streamflow metrics computed through climate year 2022 (April 2022- March 2023) for low-flow metrics and water year 2022 (October 2021 - September 2022) for all other metrics. Trends at each site are available for up to four different periods: (i) the longest possible period that meets completeness criteria at each site, (ii) 1980-2020, (iii) 1990-2020, (iv) 2000-2020. Annual metric time series analyzed for trends must have 80 percent complete records during fixed periods. In addition, each of these time series must have 80 percent complete records during their first and last decades. All longest possible period time series must be at least 10 years long and have annual metric values for at least 80% of the years running from 2013 to 2022. This data release provides the following five CSV output files along with a model archive: (1) streamflow_trend_results.csv - contains test results of all trend analyses with each row representing one unique combination of (i) NWIS streamgage identifiers, (ii) metric (computed using Oct 1 - Sep 30 water years except for low-flow metrics computed using climate years (Apr 1 - Mar 31), (iii) trend periods of interest (longest possible period through 2022, 1980-2020, 1990-2020, 2000-2020) and (iv) records containing either the full trend period or only a portion of the trend period following substantial increases in cumulative upstream reservoir storage capacity. This is an output from the final process step (#5) of the workflow. (2) streamflow_trend_trajectories_with_confidence_bands.csv - contains annual trend trajectories estimated using Theil-Sen regression, which estimates the median of the probability distribution of a metric for a given year, along with 90 percent confidence intervals (5th and 95h percentile values). This is an output from the final process step (#5) of the workflow. (3) streamflow_trend_screening_all_steps.csv - contains the screening results of all 7,873 streamgages initially considered as candidate sites for trend analysis and identifies the screens that prevented some sites from being included in the Mann-Kendall trend analysis. (4) all_site_year_metrics.csv - contains annual time series values of streamflow metrics computed from mean daily discharge data at 7,873 candidate sites. This is an output of Process Step 1 in the workflow. (5) all_site_year_filters.csv - contains information about the completeness and quality of daily mean discharge at each streamgage during each year (water year, climate year, and calendar year). This is also an output of Process Step 1 in the workflow and is combined with all_site_year_metrics.csv in Process Step 2. In addition, a .zip file contains a model archive for reproducing the trend results using R 4.4.1 statistical software. See the README file contained in the model archive for more information. Caution must be exercised when utilizing monotonic trend analyses conducted over periods of up to several decades (and in some places longer ones) due to the potential for confounding deterministic gradual trends with multi-decadal climatic fluctuations. In addition, trend results are available for post-reservoir construction periods within the four trend periods described above to avoid including abrupt changes arising from the construction of larger reservoirs in periods for which gradual monotonic trends are computed. Other abrupt changes, such as changes to water withdrawals and wastewater return flows, or episodic disturbances with multi-year recovery periods, such as wildfires, are not evaluated. Sites with pronounced abrupt changes or other non-monotonic trajectories of change may require more sophisticated trend analyses than those presented in this data release.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data set was generated in accordance with the semiconductor industry and contains data of certain process flows in Failure Analysis (FA) laboratories focusing on the identification and analysis of anomalies or malfunctions in semiconductor devices. It comprises logistic data about the processing steps for the so-called Internal Physical Inspection (IPI).
A so-called IPI job is given as a sequence of tasks that must be performed to complete the job they belong to. It has an assigned unique ID and timestamps indicating the submission, the end, and the deadline to be met. A job also has an IPI classification assigned to it, providing general guidelines on the operations to be performed.
Every task within a job has its own type and working time, as well as the assigned resources. There are two main resources involved:
- the equipment; the machine used to perform the task,
- the operator; the person who performed the task.
In addition, general information about the type of the device to be analyzed is also available, such as the given (anonymized) package and basictype. Data also include the number of stressed samples within a device and the samples a task is performed on.
The dataset includes data from 4 years, specifically from January 2020 to December 2022.
Finally, the exact column structure is given as follows (python 3.9.5 datatype):
The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.
The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.
13 major metropolitan areas: Bogota, Medellin, Cali, Baranquilla, Bucaramanga, Cucuta, Cartagena, Pasto, Ibague, Pereira, Manizales, Monteira, and Villavicencio.
The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members aged 15 to 64 included. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.
The target population for the Colombia STEP survey is all non-institutionalized persons 15 to 64 years old (inclusive) living in private dwellings in urban areas of the country at the time of data collection. This includes all residents except foreign diplomats and non-nationals working for international organizations.
The following groups are excluded from the sample: - residents of institutions (prisons, hospitals, etc.) - residents of senior homes and hospices - residents of other group dwellings such as college dormitories, halfway homes, workers' quarters, etc. - persons living outside the country at the time of data collection.
Sample survey data [ssd]
Stratified 7-stage sample design was used in Colombia. The stratification variable is city-size category.
First Stage Sample The primary sample unit (PSU) is a metropolitan area. A sample of 9 metropolitan areas was selected from the 13 metropolitan areas on the sample frame. The metropolitan areas were grouped according to city-size; the five largest metropolitan areas are included in Stratum 1 and the remaining 8 metropolitan areas are included in Stratum 2. The five metropolitan areas in Stratum 1 were selected with certainty; in Stratum 2, four metropolitan areas were selected with probability proportional to size (PPS), where the measure of size was the number of persons aged 15 to 64 in a metropolitan area.
Second Stage Sample The second stage sample unit is a Section. At the second stage of sample selection, a PPS sample of 267 Sections was selected from the sampled metropolitan areas; the measure of size was the number of persons aged 15 to 64 in a Section. The sample of 267 Sections consisted of 243 initial Sections and 24 reserve Sections to be used in the event of complete non-response at the Section level.
Third Stage Sample The third stage sample unit is a Block. Within each selected Section, a PPS sample of 4 blocks was selected; the measure of size was the number of persons aged 15 to 64 in a Block. Two sample Blocks were initially activated while the remaining two sample Blocks were reserved for use in cases where there was a refusal to cooperate at the Block level or cases where the block did not belong to the target population (e.g., parks, and commercial and industrial areas).
Fourth Stage Sample The fourth stage sample unit is a Block Segment. Regarding the Block segmentation strategy, the Colombia document 'FINAL SAMPLING PLAN (ARD-397)' states "According to the 2005 population and housing census conducted by DANE, the average number of dwellings per block in the 13 large cities or metropolitan areas was approximately 42 dwellings. Based on this finding, the defined protocol was to report those cases in which 80 or more dwellings were present in a given block in order to partition block using a random selection algorithm." At the fourth stage of sample selection, 1 Block Segment was selected in each selected Block using a simple random sample (SRS) method.
Fifth Stage Sample The fifth stage sample unit is a dwelling. At the fifth stage of sample selection, 5582 dwellings were selected from the sampled Blocks/Block Segments using a simple random sample (SRS) method. According to the Colombia document 'FINAL SAMPLING PLAN (ARD-397)', the selection of dwellings within a participant Block "was performed differentially amongst the different socioeconomic strata that the Colombian government uses for the generation of cross-subsidies for public utilities (in this case, the socioeconomic stratum used for the electricity bill was used). Given that it is known from previous survey implementations that refusal rates are highest amongst households of higher socioeconomic status, the number of dwellings to be selected increased with the socioeconomic stratum (1 being the poorest and 6 being the richest) that was most prevalent in a given block".
Sixth Stage Sample The sixth stage sample unit is a household. At the sixth stage of sample selection, one household was selected in each selected dwelling using an SRS method.
Seventh Stage Sample The seventh stage sample unit was an individual aged 15-64 (inclusive). The sampling objective was to select one individual with equal probability from each selected household.
Sampling methodologies are described for each country in two documents and are provided as external resources: (i) the National Survey Design Planning Report (NSDPR) (ii) the weighting documentation (available for all countries)
Face-to-face [f2f]
The STEP survey instruments include:
All countries adapted and translated both instruments following the STEP technical standards: two independent translators adapted and translated the STEP background questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator.
The survey instruments were piloted as part of the survey pre-test.
The background questionnaire covers such topics as respondents' demographic characteristics, dwelling characteristics, education and training, health, employment, job skill requirements, personality, behavior and preferences, language and family background.
The background questionnaire, the structure of the Reading Literacy Assessment and Reading Literacy Data Codebook are provided in the document "Colombia STEP Skills Measurement Survey Instruments", available in external resources.
STEP data management process:
1) Raw data is sent by the survey firm 2) The World Bank (WB) STEP team runs data checks on the background questionnaire data. Educational Testing Services (ETS) runs data checks on the Reading Literacy Assessment data. Comments and questions are sent back to the survey firm. 3) The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4) The WB STEP team and ETS check if the data files are clean. This might require additional iterations with the survey firm. 5) Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6) ETS scales the Reading Literacy Assessment data. 7) The WB STEP team merges the background questionnaire data with the Reading Literacy Assessment data and computes derived variables.
Detailed information on data processing in STEP surveys is provided in "STEP Guidelines for Data Processing", available in external resources. The template do-file used by the STEP team to check raw background questionnaire data is provided as an external resource, too.`
An overall response rate of 48% was achieved in the Colombia STEP Survey.
Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path. By the end of this lesson, you will have a portfolio-ready case study.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
The datasets contain the previous 12 months of Cyclistic trip data. The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study, the datasets are appropriate and will enable you to answer business questions.
This data has been made available by Motivate International Inc. under this license. This is public data that you can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that you won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.
Research question: How do annual members and casual riders use Cylistic bikes differently.
The objective of the fourth Technical Meeting on Fusion Data Processing, Validation and Analysis was to provide a platform during which a set of topics relevant to fusion data processing, validation and analysis are discussed with the view of extrapolating needs to next step fusion devices such as ITER. The validation and analysis of experimental data obtained from diagnostics used to characterize fusion plasmas are crucial for a knowledge-based understanding of the physical processes governing the dynamics of these plasmas. This paper presents the recent progress and achievements in the domain of plasma diagnostics and synthetic diagnostics data analysis (including image processing, regression analysis, inverse problems, deep learning, machine learning, big data and physics-based models for control) reported at the meeting. The progress in these areas highlight trends observed in current major fusion confinement devices. A special focus is dedicated on data analysis requirements for ITER and DEMO with a particular attention paid to Artificial Intelligence for automatization and improving reliability of control processes.
The STEP (Skills Toward Employment and Productivity) Measurement program is the first ever initiative to generate internationally comparable data on skills available in developing countries. The program implements standardized surveys to gather information on the supply and distribution of skills and the demand for skills in labor market of low-income countries.
The uniquely-designed Household Survey includes modules that measure the cognitive skills (reading, writing and numeracy), socio-emotional skills (personality, behavior and preferences) and job-specific skills (subset of transversal skills with direct job relevance) of a representative sample of adults aged 15 to 64 living in urban areas, whether they work or not. The cognitive skills module also incorporates a direct assessment of reading literacy based on the Survey of Adults Skills instruments. Modules also gather information about family, health and language.
The survey covers the urban area of two largest cities of Vietnam, Ha Noi and HCMCT.
The units of analysis are the individual respondents and households. A household roster is undertaken at the start of the survey and the individual respondent is randomly selected among all household members aged 15 to 64 included. The random selection process was designed by the STEP team and compliance with the procedure is carefully monitored during fieldwork.
The STEP target population is the population aged 15 to 64 included, living in urban areas, as defined by each country's statistical office. In Vietnam, the target population comprised all people from 15-64 years old living in urban areas in Ha Noi and Ho Chi Minh City (HCM).
The reasons for selection of these two cities include :
(i) They are two biggest cities of Vietnam, so they would have all urban characteristics needed for STEP study, and (ii) It is less costly to conduct STEP survey in these to cities, compared to all urban areas of Vietnam, given limitation of survey budget.
The following are excluded from the sample:
Sample survey data [ssd]
The sample frame includes the list of urban EAs and the count of households for each EA. Changes of the EAs list and household list would impact on coverage of sample frame. In a recent review of Ha Noi, there were only 3 EAs either new or destroyed from 140 randomly selected Eas (2%). GSO would increase the coverage of sample frame (>95% as standard) by updating the household list of the selected Eas before selecting households for STEP.
A detailed description of the sample design is available in section 4 of the NSDPR provided with the metadata. On completion of the household listing operation, GSO will deliver to the World Bank a copy of the lists, and an Excel spreadsheet with the total number of households listed in each of the 227 visited PSUs.
Face-to-face [f2f]
The STEP survey instruments include: (i) a Background Questionnaire developed by the WB STEP team (ii) a Reading Literacy Assessment developed by Educational Testing Services (ETS).
All countries adapted and translated both instruments following the STEP Technical Standards: 2 independent translators adapted and translated the Background Questionnaire and Reading Literacy Assessment, while reconciliation was carried out by a third translator. The WB STEP team and ETS collaborated closely with the survey firms during the process and reviewed the adaptation and translation to Vietnamese (using a back translation). - The survey instruments were both piloted as part of the survey pretest. - The adapted Background Questionnaires are provided in English as external resources. The Reading Literacy Assessment is protected by copyright and will not be published.
STEP Data Management Process 1. Raw data is sent by the survey firm 2. The WB STEP team runs data checks on the Background Questionnaire data. - ETS runs data checks on the Reading Literacy Assessment data. - Comments and questions are sent back to the survey firm. 3. The survey firm reviews comments and questions. When a data entry error is identified, the survey firm corrects the data. 4. The WB STEP team and ETS check the data files are clean. This might require additional iterations with the survey firm. 5. Once the data has been checked and cleaned, the WB STEP team computes the weights. Weights are computed by the STEP team to ensure consistency across sampling methodologies. 6. ETS scales the Reading Literacy Assessment data. 7. The WB STEP team merges the Background Questionnaire data with the Reading Literacy Assessment data and computes derived variables.
Detailed information data processing in STEP surveys is provided in the 'Guidelines for STEP Data Entry Programs' document provided as an external resource. The template do-file used by the STEP team to check the raw background questionnaire data is provided as an external resource.
The response rate for Vietnam (urban) was 62%. (See STEP Methodology Note Table 4).
A weighting documentation was prepared for each participating country and provides some information on sampling errors. All country weighting documentations are provided as an external resource.
The National Dynamic Land Cover Dataset (DLCD) classifies Australian land cover into 34 categories, which conform to 2007 International Standards Organisation (ISO) Land Cover Standard (19144-2). The DLCD has been developed by Geoscience Australia and the Australian Bureau of Agricultural and Resource Economics and Sciences (ABARES), aiming to provide nationally consistent land cover information to federal and state governments and general public. This paper describes machine learning techniques and statistical modeling methods developed to generate DLCD from earth observation data. MODIS (Moderate Resolution Imaging Spectroradiometer) 250m EVI (Enhanced Vegetation Index) time series data from year 2000 to year 2008 is the main data source for the modeling process, which consists of 3 steps. In the first step, noisy and invalid data points are removed from the time series. Secondly, a feature extraction algorithm converts time series into a set of 12 time series coefficients related to ground phenomenon such as average greenness and plant phenology. At the last step, clustering processes based on a tailored support vector clustering algorithm are applied on subsets of the coefficients. The resultant clusters then form the bases of further modeling process incorporating auxiliary data to generate final DLCD.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
WWF developed a global analysis of the world's most important deforestation areas or deforestation fronts in 2015. This assessment was revised in 2020 as part of the WWF Deforestation Fronts Report.Emerging Hotspots analysisThe goal of this analysis was to assess the presence of deforestation fronts: areas where deforestation is significantly increasing and is threatening remaining forests. We selected the emerging hotspots analysis to assess spatio-temporal trends of deforestation in the pan-tropics.Spatial UnitWe selected hexagons as the spatial unit for the hotspots analysis for several reasons. They have a low perimeter-to-area ratio, straightforward neighbor relationships, and reduced distortion due to curvature of the earth. For the hexagon size we decided on a unit of 1,000 ha, based on the resolution of the deforestation data (250m) meant that we could aggregate several deforestation events inside units over time. Hexagons that are closer to or equal to the size of a deforestation event means there could only be one event before the forest is gone and limit statistical analysis.We processed over 13 million hexagons for this analysis and limited the emerging hotspots analysis to only hexagons with at least 15% forest cover remaining (from the all-evidence forest map). This prevented including hotspots in agricultural areas or where all forest has been converted.OutputsThis analysis uses the Getis-Ord and Mann-Kendall statistics to identify spatial clusters of deforestation which have a non-parametric significant trend across a time series. The spatial clusters are defined by the spatial unit and a temporal neighborhood parameter. We use a neighborhood parameter of 5km to include spatial neighbors in the hotspots assessment and time slices for each country described below. Deforestation events are summarized by a spatial unit (hexagons described below) and the results comprise a trends assessment which defines increasing or decreasing deforestation in the units determined at 3 different confidence intervals (90%, 95% and 99%) and the spatio-temporal analysis classifying areas into 8 hot unique or cold spot categories. Our analysis identified 7 hotspot categories:Hotspot TypeDefinitionNewA location with a statistically significant increasing hotspots only in the final time stepConsecutiveAn uninterrupted run of statistically significant hotspot in the final time-steps IntensifyingA statistically significant hotspot for >90% of the bins, including the final time stepPersistentA statistically significant hotspot for >90% of the bins with no upward or downward trend in clustering intensityDiminishingA statistically significant hotspot for >90% of the time steps, with where the clustering is decreasing, or the most recent time step is not hot.SporadicA on-again then off-again hotspot where <90% of the time-step intervals have been statistically significant hot spots and none have been statistically significant cold spots.HistoricalAt least ninety percent of the time-step intervals have been statistically significant hot spots, with the exception of the final time steps..For the evaluation of spatio-temporal trends of tropical deforestation we selected the Terra-i deforestation dataset to define the temporal deforestation patterns. Terra-i is a freely available monitoring system derived from the analysis of MODIS (NVDI) and TRMM (rainfall) data which are used to assess forest cover changes due to anthropic interventions at a 250 m resolution [ref]. It was first developed for Latin American countries in 2012, and then expanded to pan-tropical countries around the world. Terra-i has generated maps of vegetation loss every 16 days, since January 2004. This relatively high temporal resolution of twice monthly observations allows for a more detailed emerging hotspots analysis, increasing the number of time steps or bins available for assessing spatio-temporal patterns relative to annual datasets. Next, the spatial resolution of 250m is more relevant for detecting forest loss than changes in individual tree cover or canopies and is better adapted to process trends on large scales. Finally, the added value of the Terra-i algorithm is that it employs an additional neural network machine learning to identify vegetation loss that is due to anthropic causes as opposed to natural events or other causes. Our dataset comprised all Terra-i deforestation events observed between 2004 and 2017. Temporal unitThe temporal unit or time slice was selected for each country according to the distribution of data. The deforestation data comprised 16-day periods between 2004 and 2017 for a total of 312 potential observation time periods. These were aggregated to time bins to overcome any seasonality in the detection of deforestation events (due to clouds). The temporal unit is combined with the spatial parameter (i.e. 5km) to create the space-time bins for hotspot analysis. For dense time series or countries with a lot of deforestation events (i.e. Brazil) a smaller time slice was used (i.e. 3 months, n=54) with a neighborhood interval of 8 months, meaning that the previous year and next year together were combined to assess statistical trends relative to the global variables together. The rule we employed was that the time slice x neighborhood interval was equal to 24 months, or 2 years, in order to look at general trends over the entire time period and prevent the hotspots analysis from being biased to short time intervals of a few months.Deforestation FrontsFinally, using trends and hotpots we identify 24 major deforestation fronts, areas of significantly increasing deforestation and the focus of WWF's call for action to slow deforestation.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Introduction
Welcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams.
Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
ride_id: It is a distinct identifier assigned to each individual ride. rideable_type: This column indicates the type of bikes used for each ride. started_at: This column denotes the timestamp when a particular ride began. ended_at: This column represents the timestamp when a specific ride concluded. start_station_name: This column contains the name of the station where the bike ride originated. start_station_id: This column represents the unique identifier for the station where the bike ride originated. end_station_name: This column contains the name of the station where the bike ride concluded. end_station_id: This column represents the unique identifier for the station where the bike ride concluded. start_lat: This column denotes the latitude coordinate of the starting point of the bike ride. start_lng: This column denotes the longitude coordinate of the starting point of the bike ride. end_lat: This column denotes the latitude coordinate of the ending point of the bike ride. end_lng: This column denotes the longitude coordinate of the ending point of the bike ride. member_casual: This column indicates whether the rider is a member or a casual user.