Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In the way of my journey to earn the google data analytics certificate I will practice real world example by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Picking the Bellabeat example.
Facebook
TwitterSupply chain analytics is a valuable part of data-driven decision-making in various industries such as manufacturing, retail, healthcare, and logistics. It is the process of collecting, analyzing and interpreting data related to the movement of products and services from suppliers to customers.
Facebook
TwitterWelcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path. By the end of this lesson, you will have a portfolio-ready case study. Download the packet and reference the details of this case study anytime. Then, when you begin your job hunt, your case study will be a tangible way to demonstrate your knowledge and skills to potential employers.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations. Characters and teams ● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. ● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels. ● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them. ● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime. Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members. Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs. Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends
How do annual members and casual riders use Cyclistic bikes differently? Why would casual riders buy Cyclistic annual memberships? How can Cyclistic use digital media to influence casual riders to become members? Moreno has assigned you the first question to answer: How do annual members and casual rid...
Facebook
TwitterIntroduction: I have chosen to complete a data analysis project for the second course option, Bellabeats, Inc., using a locally hosted database program, Excel for both my data analysis and visualizations. This choice was made primarily because I live in a remote area and have limited bandwidth and inconsistent internet access. Therefore, completing a capstone project using web-based programs such as R Studio, SQL Workbench, or Google Sheets was not a feasible choice. I was further limited in which option to choose as the datasets for the ride-share project option were larger than my version of Excel would accept. In the scenario provided, I will be acting as a Junior Data Analyst in support of the Bellabeats, Inc. executive team and data analytics team. This combined team has decided to use an existing public dataset in hopes that the findings from that dataset might reveal insights which will assist in Bellabeat's marketing strategies for future growth. My task is to provide data driven insights to business tasks provided by the Bellabeats, Inc.'s executive and data analysis team. In order to accomplish this task, I will complete all parts of the Data Analysis Process (Ask, Prepare, Process, Analyze, Share, Act). In addition, I will break each part of the Data Analysis Process down into three sections to provide clarity and accountability. Those three sections are: Guiding Questions, Key Tasks, and Deliverables. For the sake of space and to avoid repetition, I will record the deliverables for each Key Task directly under the numbered Key Task using an asterisk (*) as an identifier.
Section 1 - Ask:
A. Guiding Questions:
1. Who are the key stakeholders and what are their goals for the data analysis project?
2. What is the business task that this data analysis project is attempting to solve?
B. Key Tasks: 1. Identify key stakeholders and their goals for the data analysis project *The key stakeholders for this project are as follows: -Urška Sršen and Sando Mur - co-founders of Bellabeats, Inc. -Bellabeats marketing analytics team. I am a member of this team.
Section 2 - Prepare:
A. Guiding Questions: 1. Where is the data stored and organized? 2. Are there any problems with the data? 3. How does the data help answer the business question?
B. Key Tasks:
Research and communicate the source of the data, and how it is stored/organized to stakeholders.
*The data source used for our case study is FitBit Fitness Tracker Data. This dataset is stored in Kaggle and was made available through user Mobius in an open-source format. Therefore, the data is public and available to be copied, modified, and distributed, all without asking the user for permission. These datasets were generated by respondents to a distributed survey via Amazon Mechanical Turk reportedly (see credibility section directly below) between 03/12/2016 thru 05/12/2016.
*Reportedly (see credibility section directly below), thirty eligible Fitbit users consented to the submission of personal tracker data, including output related to steps taken, calories burned, time spent sleeping, heart rate, and distance traveled. This data was broken down into minute, hour, and day level totals. This data is stored in 18 CSV documents. I downloaded all 18 documents into my local laptop and decided to use 2 documents for the purposes of this project as they were files which had merged activity and sleep data from the other documents. All unused documents were permanently deleted from the laptop. The 2 files used were:
-sleepDay_merged.csv
-dailyActivity_merged.csv
Identify and communicate to stakeholders any problems found with the data related to credibility and bias. *As will be more specifically presented in the Process section, the data seems to have credibility issues related to the reported time frame of the data collected. The metadata seems to indicate that the data collected covered roughly 2 months of FitBit tracking. However, upon my initial data processing, I found that only 1 month of data was reported. *As will be more specifically presented in the Process section, the data has credibility issues related to the number of individuals who reported FitBit data. Specifically, the metadata communicates that 30 individual users agreed to report their tracking data. My initial data processing uncovered 33 individual ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The analysis of small data sets in longitudinal studies can lead to power issues and often suffers from biased parameter values. These issues can be solved by using Bayesian estimation in conjunction with informative prior distributions. By means of a simulation study and an empirical example concerning posttraumatic stress symptoms (PTSS) following mechanical ventilation in burn survivors, we demonstrate the advantages and potential pitfalls of using Bayesian estimation. First, we show how to specify prior distributions and by means of a sensitivity analysis we demonstrate how to check the exact influence of the prior (mis-) specification. Thereafter, we show by means of a simulation the situations in which the Bayesian approach outperforms the default, maximum likelihood and approach. Finally, we re-analyze empirical data on burn survivors which provided preliminary evidence of an aversive influence of a period of mechanical ventilation on the course of PTSS following burns. Not suprisingly, maximum likelihood estimation showed insufficient coverage as well as power with very small samples. Only when Bayesian analysis, in conjunction with informative priors, was used power increased to acceptable levels. As expected, we showed that the smaller the sample size the more the results rely on the prior specification. We show that two issues often encountered during analysis of small samples, power and biased parameters, can be solved by including prior information into Bayesian analysis. We argue that the use of informative priors should always be reported together with a sensitivity analysis.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
To analyze the salaries of company employees using Pandas, NumPy, and other tools, you can structure the analysis process into several steps:
Case Study: Employee Salary Analysis In this case study, we aim to analyze the salaries of employees across different departments and levels within a company. Our goal is to uncover key patterns, identify outliers, and provide insights that can support decisions related to compensation and workforce management.
Step 1: Data Collection and Preparation Data Sources: The dataset typically includes employee ID, name, department, position, years of experience, salary, and additional compensation (bonuses, stock options, etc.). Data Cleaning: We use Pandas to handle missing or incomplete data, remove duplicates, and standardize formats. Example: df.dropna() to handle missing salary information, and df.drop_duplicates() to eliminate duplicate entries. Step 2: Data Exploration and Descriptive Statistics Exploratory Data Analysis (EDA): Using Pandas to calculate basic statistics such as mean, median, mode, and standard deviation for employee salaries. Example: df['salary'].describe() provides an overview of the distribution of salaries. Data Visualization: Leveraging tools like Matplotlib or Seaborn for visualizing salary distributions, box plots to detect outliers, and bar charts for department-wise salary breakdowns. Example: sns.boxplot(x='department', y='salary', data=df) provides a visual representation of salary variations by department. Step 3: Analysis Using NumPy Calculating Salary Ranges: NumPy can be used to calculate the range, variance, and percentiles of salary data to identify the spread and skewness of the salary distribution. Example: np.percentile(df['salary'], [25, 50, 75]) helps identify salary quartiles. Correlation Analysis: Identify the relationship between variables such as experience and salary using NumPy to compute correlation coefficients. Example: np.corrcoef(df['years_of_experience'], df['salary']) reveals if experience is a significant factor in salary determination. Step 4: Grouping and Aggregation Salary by Department and Position: Using Pandas' groupby function, we can summarize salary information for different departments and job titles to identify trends or inequalities. Example: df.groupby('department')['salary'].mean() calculates the average salary per department. Step 5: Salary Forecasting (Optional) Predictive Analysis: Using tools such as Scikit-learn, we could build a regression model to predict future salary increases based on factors like experience, education level, and performance ratings. Step 6: Insights and Recommendations Outlier Identification: Detect any employees earning significantly more or less than the average, which could signal inequities or high performers. Salary Discrepancies: Highlight any salary discrepancies between departments or gender that may require further investigation. Compensation Planning: Based on the analysis, suggest potential changes to the salary structure or bonus allocations to ensure fair compensation across the organization. Tools Used: Pandas: For data manipulation, grouping, and descriptive analysis. NumPy: For numerical operations such as percentiles and correlations. Matplotlib/Seaborn: For data visualization to highlight key patterns and trends. Scikit-learn (Optional): For building predictive models if salary forecasting is included in the analysis. This approach ensures a comprehensive analysis of employee salaries, providing actionable insights for human resource planning and compensation strategy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset contains the text of the documents that are sources of evidence used in [1] and [2] to distill our reference scenarios according to the methodology suggested by Yin in [3].
The dataset is composed of 95 unique document texts spanning the period 2005-2022. This dataset makes available a corpus of documentary sources useful for outlining case studies related to scenarios in which the DPO finds himself operating in the performance of his daily activities.
The language used in the corpus is mainly Italian, but some documents are in English and French. For the reader's benefit, we provide an English translation of the title of each document.
The documentary sources are of many types (for example, court decisions, supervisory authorities' decisions, job advertisements, and newspaper articles), provided by different bodies (such as supervisor authorities, data controllers, European Union institutions, private companies, courts, public authorities, research organizations, newspapers, and public administrations), and redacted from distinct professional roles (for example, data protection officers, general managers, university rectors, collegiate bodies, judges, and journalists).
The documentary sources were collected from 31 different bodies. Most of the documents in the corpus (a total of 83 documents) have been transformed into Rich Text Format (RTF), while the other documents (a total of 12) are in PDF format. All the documents have been manually read and verified. The dataset is helpful as a starting point for a case studies analysis on the daily issues a data protection officer face. Details on the methodology can be found in the accompanying papers.
The available files are as follows:
documents-texts.zip --> contain a directory of .rtf files (in some cases .pdf files) with the text of documents used as sources for the case studies. Each file has been renamed with its SHA1 hash so that it can be easily recognized.
documents-metadata.csv --> Contains a CSV file with the metadata for each document used as a source for the case studies.
This dataset is the original one used in the publication [1] and the preprint containing the additional material [2].
[1] F. Ciclosi and F. Massacci, "The Data Protection Officer: A Ubiquitous Role That No One Really Knows" in IEEE Security & Privacy, vol. 21, no. 01, pp. 66-77, 2023, doi: 10.1109/MSEC.2022.3222115, url: https://doi.ieeecomputersociety.org/10.1109/MSEC.2022.3222115.
[2] F. Ciclosi and F. Massacci, "The Data Protection Officer, an ubiquitous role nobody really knows." arXiv preprint arXiv:2212.07712, 2022.
[3] R. K. Yin, Case study research and applications. Sage, 2018.
Facebook
TwitterThe KILVOLC_FlowerKahn2021_1 dataset is the MISR Derived Case Study Data for Kilauea Volcanic Eruptions Including Geometric Plume Height and Qualitative Radiometric Particle Property Information version 1 dataset. It comprises MISR-derived output from a comprehensive analysis of Kilauea volcanic eruptions (2000-2018). Data collection for this dataset is complete. The data presented here are analyzed and discussed in the following paper: Flower, V.J.B., and R.A. Kahn, 2021. Twenty years of NASA-EOS multi-sensor satellite observations at Kīlauea volcano (2000-2019). J. Volc. Geo. Res. (in press).The data is subdivided by date and MISR orbit number. Within each case folder, there are up to 11 files relating to an individual MISR overpass. Files include plume height records (from both the red and blue spectral bands) derived from the MISR INteractive eXplorer (MINX) program, displayed in: map view, downwind profile plot (along with the associated wind vectors retrieved at plume elevation), a histogram of retrieved plume heights and a text file containing the digital plume height values. An additional JPG is included delineating the plume analysis region, start point for assessing downwind distance, and input wind direction used to initialize the MINX retrieval. A final two files are generated from the MISR Research Aerosol (RA) retrieval algorithm (Limbacher, J.A., and R.A. Kahn, 2014. MISR Research-Aerosol-Algorithm: Refinements For Dark Water Retrievals. Atm. Meas. Tech. 7, 1-19, doi:10.5194/amt-7-1-2014). These files include the RA model output in HDF5, and an associated JPG of key derived variables (e.g. Aerosol Optical Depth, Angstrom Exponent, Single Scattering Albedo, Fraction of Non-Spherical components, model uncertainty classifications and example camera views). File numbers per folder vary depending on the retrieval conditions of specific observations. RA plume retrievals are limited when cloud cover was widespread or the solar radiance was insufficient to run the RA. In these cases the RA files are not included in the individual folders. In cases where activity was observed from multiple volcanic zones in a single overpass, individual folders containing data relating to a single region, are included, and defined by a qualifier (e.g. '_1').
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract During analysis of scientific research data, it is customary to encounter anomalous values or missing data. Anomalous values can be the result of errors of recording, typing, measurement by instruments, or may be true outliers. This review discusses concepts, examples and methods for identifying and dealing with such contingencies. In the case of missing data, techniques for imputation of the values are discussed in, order to avoid exclusion of the research subject, if it is not possible to retrieve information from registration forms or to re-address the participant.
Facebook
TwitterThe KILVOLC_FlowerKahn2021_1 dataset is the MISR Derived Case Study Data for Kilauea Volcanic Eruptions Including Geometric Plume Height and Qualitative Radiometric Particle Property Information version 1 dataset. It comprises MISR-derived output from a comprehensive analysis of Kilauea volcanic eruptions (2000-2018). Data collection for this dataset is complete. The data presented here are analyzed and discussed in the following paper: Flower, V.J.B., and R.A. Kahn, 2021. Twenty years of NASA-EOS multi-sensor satellite observations at Kīlauea volcano (2000-2019). J. Volc. Geo. Res. (in press).The data is subdivided by date and MISR orbit number. Within each case folder, there are up to 11 files relating to an individual MISR overpass. Files include plume height records (from both the red and blue spectral bands) derived from the MISR INteractive eXplorer (MINX) program, displayed in: map view, downwind profile plot (along with the associated wind vectors retrieved at plume elevation), a histogram of retrieved plume heights and a text file containing the digital plume height values. An additional JPG is included delineating the plume analysis region, start point for assessing downwind distance, and input wind direction used to initialize the MINX retrieval. A final two files are generated from the MISR Research Aerosol (RA) retrieval algorithm (Limbacher, J.A., and R.A. Kahn, 2014. MISR Research-Aerosol-Algorithm: Refinements For Dark Water Retrievals. Atm. Meas. Tech. 7, 1-19, doi:10.5194/amt-7-1-2014). These files include the RA model output in HDF5, and an associated JPG of key derived variables (e.g. Aerosol Optical Depth, Angstrom Exponent, Single Scattering Albedo, Fraction of Non-Spherical components, model uncertainty classifications and example camera views). File numbers per folder vary depending on the retrieval conditions of specific observations. RA plume retrievals are limited when cloud cover was widespread or the solar radiance was insufficient to run the RA. In these cases the RA files are not included in the individual folders. In cases where activity was observed from multiple volcanic zones in a single overpass, individual folders containing data relating to a single region, are included, and defined by a qualifier (e.g. '_1').
Facebook
Twitterhttps://rightsstatements.org/page/InC/1.0/?language=enhttps://rightsstatements.org/page/InC/1.0/?language=en
This proposed research aims to explore how institutionalization of Business Intelligence (BI) can enhance organizational agility in developing countries. Business performance is increasingly affected by unanticipated emerging opportunities or threats from constant diverse changes occurring within the environment. Decision-making to cope with the changing environment becomes a challenge for taking opportunities or managing threats. Organizational agility is the ability of sensing and taking opportunities through responding to those changes with speed. As BI provides data-driven decision-making support, BI institutionalization is vital for enhancing organizational agility to make a decision for responding to the dynamic environment. However, there has been little prior research in this area focussed on developing countries. Therefore, this research addresses the research gap in how BI institutionalization in developing countries can enhance organizational agility. Bangladesh is used as an example of a developing country. A multiple case study approach was employed for collecting qualitative data using open-ended interviews. The studied data were analysed to generate new understanding of how BI institutionalization impacts organizational agility for decision-making in the context of developing countries.
Facebook
TwitterCase studies of selected streams from the EPA Ash Dataset. Includes OLI Studio and Geochemist's Workbench example files. Original data from: Nguyen, Dan-Tam, Eastern Research Group. Sep 29, 2015. Analytical Database for the Steam Electric Rulemaking - DCN SE05359. https://www.regulations.gov/document/EPA-HQ-OW-2009-0819-5640
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains replication materials for the Journal of Educational and Behavioral Statistics paper entitled: "Combining Human and Automated Scoring Methods in Experimental Assessments of Writing: A Case Study Tutorial." Materials include the dataset. The replication code is available at: github.com/reaganmozer/reads-replication
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.
The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 4.ipynb)
RAiD: https://doi.org/10.26292/8679d473
This repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.
Compare Bureau of Meteorology gridded daily maximum and minimum temperature data with data from weather stations across Western Australia.
This is an example of comparing high-quality ground-based data from multiple sites with a data product from satellite imagery or data modelling, so that I can assess its precision and accuracy for estimating the same variables at other sites.
The case study uses national weather data products from the Bureau of Meteorology for daily mean maximum/minimum temperature, accessible from http://www.bom.gov.au/jsp/awap/temp/index.jsp. Seven daily maximum and minimum temperature grids were downloaded for the dates 7 to 13 April 2025 inclusive. These data can be accessed in the source_data folder in the downloaded ASCII grid format (*.grid). These data will be loaded into the data cube as WGS84 Geotiff files. To avoid extra dependencies in this notebook, the data have already been converted using QGIS Desktop and are also included in the source_data folder (*.tiff).
Comparison data for maximum and minimum air temperature were downloaded for all public weather stations in Western Australia from https://weather.agric.wa.gov.au/ for the 10 day period 4 to 13 April 2025. These are included in source_data as CSV files. These downloads do not include the coordinates for the weather stations. These were downloaded via the https://api.agric.wa.gov.au/v2/weather/openapi/#/Stations/getStations API method and are included in source_data as DPIRD_weather_stations.json.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sheet 1 (Raw-Data): The raw data of the study is provided, presenting the tagging results for the used measures described in the paper. For each subject, it includes multiple columns: A. a sequential student ID B an ID that defines a random group label and the notation C. the used notation: user Story or use Cases D. the case they were assigned to: IFA, Sim, or Hos E. the subject's exam grade (total points out of 100). Empty cells mean that the subject did not take the first exam F. a categorical representation of the grade L/M/H, where H is greater or equal to 80, M is between 65 included and 80 excluded, L otherwise G. the total number of classes in the student's conceptual model H. the total number of relationships in the student's conceptual model I. the total number of classes in the expert's conceptual model J. the total number of relationships in the expert's conceptual model K-O. the total number of encountered situations of alignment, wrong representation, system-oriented, omitted, missing (see tagging scheme below) P. the researchers' judgement on how well the derivation process explanation was explained by the student: well explained (a systematic mapping that can be easily reproduced), partially explained (vague indication of the mapping ), or not present.
Tagging scheme:
Aligned (AL) - A concept is represented as a class in both models, either
with the same name or using synonyms or clearly linkable names;
Wrongly represented (WR) - A class in the domain expert model is
incorrectly represented in the student model, either (i) via an attribute,
method, or relationship rather than class, or
(ii) using a generic term (e.g., user'' instead ofurban
planner'');
System-oriented (SO) - A class in CM-Stud that denotes a technical
implementation aspect, e.g., access control. Classes that represent legacy
system or the system under design (portal, simulator) are legitimate;
Omitted (OM) - A class in CM-Expert that does not appear in any way in
CM-Stud;
Missing (MI) - A class in CM-Stud that does not appear in any way in
CM-Expert.
All the calculations and information provided in the following sheets
originate from that raw data.
Sheet 2 (Descriptive-Stats): Shows a summary of statistics from the data collection,
including the number of subjects per case, per notation, per process derivation rigor category, and per exam grade category.
Sheet 3 (Size-Ratio):
The number of classes within the student model divided by the number of classes within the expert model is calculated (describing the size ratio). We provide box plots to allow a visual comparison of the shape of the distribution, its central value, and its variability for each group (by case, notation, process, and exam grade) . The primary focus in this study is on the number of classes. However, we also provided the size ratio for the number of relationships between student and expert model.
Sheet 4 (Overall):
Provides an overview of all subjects regarding the encountered situations, completeness, and correctness, respectively. Correctness is defined as the ratio of classes in a student model that is fully aligned with the classes in the corresponding expert model. It is calculated by dividing the number of aligned concepts (AL) by the sum of the number of aligned concepts (AL), omitted concepts (OM), system-oriented concepts (SO), and wrong representations (WR). Completeness on the other hand, is defined as the ratio of classes in a student model that are correctly or incorrectly represented over the number of classes in the expert model. Completeness is calculated by dividing the sum of aligned concepts (AL) and wrong representations (WR) by the sum of the number of aligned concepts (AL), wrong representations (WR) and omitted concepts (OM). The overview is complemented with general diverging stacked bar charts that illustrate correctness and completeness.
For sheet 4 as well as for the following four sheets, diverging stacked bar
charts are provided to visualize the effect of each of the independent and mediated variables. The charts are based on the relative numbers of encountered situations for each student. In addition, a "Buffer" is calculated witch solely serves the purpose of constructing the diverging stacked bar charts in Excel. Finally, at the bottom of each sheet, the significance (T-test) and effect size (Hedges' g) for both completeness and correctness are provided. Hedges' g was calculated with an online tool: https://www.psychometrica.de/effect_size.html. The independent and moderating variables can be found as follows:
Sheet 5 (By-Notation):
Model correctness and model completeness is compared by notation - UC, US.
Sheet 6 (By-Case):
Model correctness and model completeness is compared by case - SIM, HOS, IFA.
Sheet 7 (By-Process):
Model correctness and model completeness is compared by how well the derivation process is explained - well explained, partially explained, not present.
Sheet 8 (By-Grade):
Model correctness and model completeness is compared by the exam grades, converted to categorical values High, Low , and Medium.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MCCN project is to deliver tools to assist the agricultural sector to understand crop-environment relationships, specifically by facilitating generation of data cubes for spatiotemporal data. This repository contains Jupyter notebooks to demonstrate the functionality of the MCCN data cube components.
The dataset contains input files for the case study (source_data), RO-Crate metadata (ro-crate-metadata.json), results from the case study (results), and Jupyter Notebook (MCCN-CASE 6.ipynb)
RAiD: https://doi.org/10.26292/8679d473
This repository contains code and sample data for the following case studies. Note that the analyses here are to demonstrate the software and result should not be considered scientifically or statistically meaningful. No effort has been made to address bias in samples, and sample data may not be available at sufficient density to warrant analysis. All case studies end with generation of an RO-Crate data package including the source data, the notebook and generated outputs, including netcdf exports of the datacubes themselves.
Analyse relationship between different environmental drivers and plant yield. This study demonstrates: 1) Loading heterogeneous data sources into a cube, and 2) Analysis and visualisation of drivers. This study combines a suite of spatial variables at different scales across multiple sites to analyse the factors correlated with a variable of interest.
The dataset includes the Gilbert site in Queensland which has multiple standard sized plots for three years. We are using data from 2022. The source files are part pf the larger collection - Chapman, Scott and Smith, Daniel (2023). INVITA Core site UAV dataset. The University of Queensland. Data Collection. https://doi.org/10.48610/951f13c
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
With the increasing popularity of artificial intelligence methods and structural bioinformatics tools, such as AlphaFold, cultivating students' ability to conduct quantitative analysis using big data on protein structures has become a new challenge in biochemistry education. This paper constructs a teaching case based on a serine hydrolase, integrating the principles of statistical mechanics, the Dunbrack backbone-dependent rotamer library, the Boltzmann probability-energy relationship, and transition state theory. Students are guided through the entire process, from structural dataset acquisition and conformational probability distribution analysis to catalytic rate enhancement prediction, all within a Jupyter Notebook environment. Students analyze the χ₁ dihedral angle distribution of the catalytic serine (Ser195) in different binding states (apo, substrate analog GSA, and transition state analog TSA), calculate pseudo-energies corresponding to conformational preferences, and derive catalytic acceleration from ΔE₀. This case not only strengthens students' database searching, statistical analysis, and scripting skills, but also deepens their understanding of the enzyme catalytic mechanism of "ground state destabilization," laying a methodological foundation for future research in protein design and computational enzyme engineering.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This data set is made available through the Google Analytics Coursera course. This data set is a part of a case study example, meant to showcase skills learned throughout the course.
Facebook
TwitterThis study includes five data files and corresponding exercise instructions.
Four of the five data files and instructions were produced from the National Child Development Study datasets for an ESRC-funded workshop on Multilevel Event History Analysis, held in February 2005. The workshop data includes three files in ASCII DAT format and one in SPSS SAV format. Further information and documentation beyond that included in this study, and MLwiN software downloads are available from the Centre for Multilevel Modelling web site.
In addition, for the second edition of the study, example data and documentation for fitting multilevel multiprocess event history models using aML software were added to the dataset (the data file 'amlex.raw'). The aML syntax file that accompanies these data can also be found at the Centre for Multilevel Modelling web site noted above.
The project from which these data were produced was conducted under the ESRC Research Methods programme. It involved the development of multilevel simultaneous equations models for the analysis of correlated event histories. The research was motivated by a study of the interrelationships between partnership (marriage or cohabitation) durations and decisions about childbearing, using event history data from the 1958 and 1970 British Birth Cohort studies (in the case of this dataset, NCDS).
Additional aims and objectives of the project were to develop methodology for the analysis of complex event history data; provide means for implementing methodology in existing software; and provide social scientists with practical training in advanced event history analysis.
Facebook
Twitterhttp://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/
In the way of my journey to earn the google data analytics certificate I will practice real world example by following the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Picking the Bellabeat example.