Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)
Facebook
TwitterThis file contains the Fourier Transform Infrared Spectroscopy (FTIR) Spectroscopy Data from NOAA R/V Ronald H. Brown ship during VOCALS-REx 2008.
Facebook
TwitterThis data package contains pumping data (.txt), parameter matrices, and R code (.R, .RData) to perform bootstrapping for parameter selection for the bioclogging model development. The pumping data were collected from the Russian River Riverbank Filtration site located in Sonoma County, California from 2010-2017 from three riverbank collection wells located alongside the study site. The pumping data is directly correlated with water table oscillations, so the code performs these correlations and simulates stochastic versions of water table oscillations. See Metadata Description.pdf for full details on dataset production. This dataset must be used with the R programming language. This dataset and R code is associated with the publication "Influence of Hydrological Perturbations and Riverbed Sediment Characteristics on Hyporheic Zone Respiration of CO2 and N-2" This research was supported by the Jane Lewis Fellowship from the University of California, Berkeley, the Sonoma County Water Agency (SCWA), the Roy G. Post Foundation Scholarship, the U.S. Department of Energy, Office of Science Graduate Student Research (SCGSR) Program, U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research under award DE-AC02-05CH11231, and the UFZ-Helmholtz Centre for Environmental Research, Leipzig, Germany.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
To get the consumption model from Section 3.1, one needs load execute the file consumption_data.R. Load the data for the 3 Phases ./data/CONSUMPTION/PL1.csv, PL2.csv, PL3.csv, transform the data and build the model (starting line 225). The final consumption data can be found in one file for each year in ./data/CONSUMPTION/MEGA_CONS_list.Rdata
To get the results for the optimization problem, one needs to execute the file analyze_data.R. It provides the functions to compare production and consumption data, and to optimize for the different values (PV, MBC,).
To reproduce the figures one needs to execute the file visualize_results.R. It provides the functions to reproduce the figures.
To calculate the solar radiation that is needed in the Section Production Data, follow file calculate_total_radiation.R.
To reproduce the radiation data from from ERA5, that can be found in data.zip, do the following steps: 1. ERA5 - download the reanalysis datasets as GRIB file. For FDIR select "Total sky direct solar radiation at surface", for GHI select "Surface solar radiation downwards", and for ALBEDO select "Forecast albedo". 2. convert GRIB to csv with the file era5toGRID.sh 3. convert the csv file to the data that is used in this paper with the file convert_year_to_grid.R
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article introduces a graphical goodness-of-fit test for copulas in more than two dimensions. The test is based on pairs of variables and can thus be interpreted as a first-order approximation of the underlying dependence structure. The idea is to first transform pairs of data columns with the Rosenblatt transform to bivariate standard uniform distributions under the null hypothesis. This hypothesis can be graphically tested with a matrix of bivariate scatterplots, Q-Q plots, or other transformations. Furthermore, additional information can be encoded as background color, such as measures of association or (approximate) p-values of tests of independence. The proposed goodness-of-fit test is designed as a basic graphical tool for detecting deviations from a postulated, possibly high-dimensional, dependence model. Various examples are given and the methodology is applied to a financial dataset. An implementation is provided by the R package copula. Supplementary material for this article is available online, which provides the R package copula and reproduces all the graphical results of this article.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the ‘ehr_format(dt)’ single-line command is non-tabular data assigned as R object ‘dt’ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this ‘untidy-format’ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as “Potassium,” “Potassium-External,” “Potassium(POC),” “Potassium,whole-bld,” “Potassium-Level-External,” “Potassium,venous,” and “Potassium-whole-bld/plasma.” eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the ‘Labs’ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
Facebook
TwitterWelcome to the Cyclistic bike-share analysis case study! In this case study, you will perform many real-world tasks of a junior data analyst. You will work for a fictional company, Cyclistic, and meet different characters and team members. In order to answer the key business questions, you will follow the steps of the data analysis process: ask, prepare, process, analyze, share, and act. Along the way, the Case Study Roadmap tables — including guiding questions and key tasks — will help you stay on the right path.
You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.
How do annual members and casual riders use Cyclistic bikes differently?
What is the problem you are trying to solve?
How do annual members and casual riders use Cyclistic bikes differently?
How can your insights drive business decisions?
The insight will help the marketing team to make a strategy for casual riders
Where is your data located?
Data located in Cyclistic organization data.
How is data organized?
Dataset are in csv format for each month wise from Financial year 22.
Are there issues with bias or credibility in this data? Does your data ROCCC?
It is good it is ROCCC because data collected in from Cyclistic organization.
How are you addressing licensing, privacy, security, and accessibility?
The company has their own license over the dataset. Dataset does not have any personal information about the riders.
How did you verify the data’s integrity?
All the files have consistent columns and each column has the correct type of data.
How does it help you answer your questions?
Insights always hidden in the data. We have the interpret with data to find the insights.
Are there any problems with the data?
Yes, starting station names, ending station names have null values.
What tools are you choosing and why?
I used R studio for the cleaning and transforming the data for analysis phase because of large dataset and to gather experience in the language.
Have you ensured the data’s integrity?
Yes, the data is consistent throughout the columns.
What steps have you taken to ensure that your data is clean?
First duplicates, null values are removed then added new columns for analysis.
How can you verify that your data is clean and ready to analyze?
Make sure the column names are consistent thorough out all data sets by using the “bind row” function.
Make sure column data types are consistent throughout all the dataset by using the “compare_df_col” from the “janitor” package.
Combine the all dataset into single data frame to make consistent throught the analysis.
Removed the column start_lat, start_lng, end_lat, end_lng from the dataframe because those columns not required for analysis.
Create new columns day, date, month, year, from the started_at column this will provide additional opportunities to aggregate the data
Create the “ride_length” column from the started_at and ended_at column to find the average duration of the ride by the riders.
Removed the null rows from the dataset by using the “na.omit function”
Have you documented your cleaning process so you can review and share those results?
Yes, the cleaning process is documented clearly.
How should you organize your data to perform analysis on it? The data has been organized in one single dataframe by using the read csv function in R Has your data been properly formatted? Yes, all the columns have their correct data type.
What surprises did you discover in the data?
Casual member ride duration is higher than the annual members
Causal member widely uses docked bike than the annual members
What trends or relationships did you find in the data?
Annual members are used mainly for commute purpose
Casual member are preferred the docked bikes
Annual members are preferred the electric or classic bikes
How will these insights help answer your business questions?
This insights helps to build a profile for members
Were you able to answer the question of how ...
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
NaiveBayes_R.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given recidivism (P(x_ij│R)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|R): This tab contains probabilities of feature attributes among recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. Recidivated_Train: This tab contains re-coded features of recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|R) tab. NaiveBayes_NR.xlsx: This Excel file includes information as to how probabilities of observed features are calculated given non-recidivism (P(x_ij│N)) in the training data. Each cell is embedded with an Excel function to render appropriate figures. P(Xi|N): This tab contains probabilities of feature attributes among non-recidivated offenders. NIJ_Recoded: This tab contains re-coded NIJ recidivism challenge data following our coding schema described in Table 1. NonRecidivated_Train: This tab contains re-coded features of non-recidivated offenders. Tabs from [Gender] through [Condition_Other]: Each tab contains probabilities of feature attributes given non-recidivism. We use these conditional probabilities to replace the raw values of each feature in P(Xi|N) tab. Training_LnTransformed.xlsx: Figures in each cell are log-transformed ratios of probabilities in NaiveBayes_R.xlsx (P(Xi|R)) to the probabilities in NaiveBayes_NR.xlsx (P(Xi|N)). TestData.xlsx: This Excel file includes the following tabs based on the test data: P(Xi|R), P(Xi|N), NIJ_Recoded, and Test_LnTransformed (log-transformed P(Xi|R)/ P(Xi|N)). Training_LnTransformed.dta: We transform Training_LnTransformed.xlsx to Stata data set. We use Stat/Transfer 13 software package to transfer the file format. StataLog.smcl: This file includes the results of the logistic regression analysis. Both estimated intercept and coefficient estimates in this Stata log correspond to the raw weights and standardized weights in Figure 1. Brier Score_Re-Check.xlsx: This Excel file recalculates Brier scores of Relaxed Naïve Bayes Classifier in Table 3, showing evidence that results displayed in Table 3 are correct. *****Full List***** NaiveBayes_R.xlsx NaiveBayes_NR.xlsx Training_LnTransformed.xlsx TestData.xlsx Training_LnTransformed.dta StataLog.smcl Brier Score_Re-Check.xlsx Data for Weka (Training Set): Bayes_2022_NoID Data for Weka (Test Set): BayesTest_2022_NoID Weka output for machine learning models (Conventional naïve Bayes, AdaBoost, Multilayer Perceptron, Logistic Regression, and Random Forest)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the experiment file (.xpm), the settings and values of the advanced parameters (e.g. resolution, sample scan time, background scan time, spectral range to be used.) are stored. Meanwhile, the phase resolution is stored in the FT. The optic parameters are shown in this experimental condition as well.
Facebook
Twitterhttps://spdx.org/licenses/etalab-2.0.htmlhttps://spdx.org/licenses/etalab-2.0.html
Raw data and preparation R scripts of the firmness analysis on apple sticks (raw and cooked) studied in the Pomcuite project (INRAE TRANSFORM internal funding ANS). This dataset contains : - the R Markdown file giving details on the raw dataset and the analysis scripts - the PDF processed from the R Markdown - the data set in CSV format
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This article proposes a graphical model that handles mixed-type, multi-group data. The motivation for such a model originates from real-world observational data, which often contain groups of samples obtained under heterogeneous conditions in space and time, potentially resulting in differences in network structure among groups. Therefore, the iid assumption is unrealistic, and fitting a single graphical model on all data results in a network that does not accurately represent the between group differences. In addition, real-world observational data is typically of mixed discrete-and-continuous type, violating the Gaussian assumption that is typical of graphical models, which leads to the model being unable to adequately recover the underlying graph structure. Both these problems are solved by fitting a different graph for each group, applying the fused group penalty to fuse similar graphs together and by treating the observed data as transformed latent Gaussian data, respectively. The proposed model outperforms related models on learning partial correlations in a simulation study. Finally, the proposed model is applied on real on-farm maize yield data, showcasing the added value of the proposed method in generating new production-ecological hypotheses. An R package containing the proposed methodology can be found on https://CRAN.R-project.org/package=heteromixgm. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
fisheries management is generally based on age structure models. thus, fish ageing data are collected by experts who analyze and interpret calcified structures (scales, vertebrae, fin rays, otoliths, etc.) according to a visual process. the otolith, in the inner ear of the fish, is the most commonly used calcified structure because it is metabolically inert and historically one of the first proxies developed. it contains information throughout the whole life of the fish and provides age structure data for stock assessments of all commercial species. the traditional human reading method to determine age is very time-consuming. automated image analysis can be a low-cost alternative method, however, the first step is the transformation of routinely taken otolith images into standardized images within a database to apply machine learning techniques on the ageing data. otolith shape, resulting from the synthesis of genetic heritage and environmental effects, is a useful tool to identify stock units, therefore a database of standardized images could be used for this aim. using the routinely measured otolith data of plaice (pleuronectes platessa; linnaeus, 1758) and striped red mullet (mullus surmuletus; linnaeus, 1758) in the eastern english channel and north-east arctic cod (gadus morhua; linnaeus, 1758), a greyscale images matrix was generated from the raw images in different formats. contour detection was then applied to identify broken otoliths, the orientation of each otolith, and the number of otoliths per image. to finalize this standardization process, all images were resized and binarized. several mathematical morphology tools were developed from these new images to align and to orient the images, placing the otoliths in the same layout for each image. for this study, we used three databases from two different laboratories using three species (cod, plaice and striped red mullet). this method was approved to these three species and could be applied for others species for age determination and stock identification.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data is archived here: https://doi.org/10.5281/zenodo.4818011Data and code archive provides all the files that are necessary to replicate the empirical analyses that are presented in the paper "Climate impacts and adaptation in US dairy systems 1981-2018" authored by Maria Gisbert-Queral, Arne Henningsen, Bo Markussen, Meredith T. Niles, Ermias Kebreab, Angela J. Rigden, and Nathaniel D. Mueller and published in 'Nature Food' (2021, DOI: 10.1038/s43016-021-00372-z). The empirical analyses are entirely conducted with the "R" statistical software using the add-on packages "car", "data.table", "dplyr", "ggplot2", "grid", "gridExtra", "lmtest", "lubridate", "magrittr", "nlme", "OneR", "plyr", "pracma", "quadprog", "readxl", "sandwich", "tidyr", "usfertilizer", and "usmap". The R code was written by Maria Gisbert-Queral and Arne Henningsen with assistance from Bo Markussen. Some parts of the data preparation and the analyses require substantial amounts of memory (RAM) and computational power (CPU). Running the entire analysis (all R scripts consecutively) on a laptop computer with 32 GB physical memory (RAM), 16 GB swap memory, an 8-core Intel Xeon CPU E3-1505M @ 3.00 GHz, and a GNU/Linux/Ubuntu operating system takes around 11 hours. Running some parts in parallel can speed up the computations but bears the risk that the computations terminate when two or more memory-demanding computations are executed at the same time.This data and code archive contains the following files and folders:* READMEDescription: text file with this description* flowchart.pdfDescription: a PDF file with a flow chart that illustrates how R scripts transform the raw data files to files that contain generated data sets and intermediate results and, finally, to the tables and figures that are presented in the paper.* runAll.shDescription: a (bash) shell script that runs all R scripts in this data and code archive sequentially and in a suitable order (on computers with a "bash" shell such as most computers with MacOS, GNU/Linux, or Unix operating systems)* Folder "DataRaw"Description: folder for raw data filesThis folder contains the following files:- DataRaw/COWS.xlsxDescription: MS-Excel file with the number of cows per countySource: USDA NASS QuickstatsObservations: All available counties and years from 2002 to 2012- DataRaw/milk_state.xlsxDescription: MS-Excel file with average monthly milk yields per cowSource: USDA NASS QuickstatsObservations: All available states from 1981 to 2018- DataRaw/TMAX.csvDescription: CSV file with daily maximum temperaturesSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/VPD.csvDescription: CSV file with daily maximum vapor pressure deficitsSource: PRISM Climate Group (spatially averaged)Observations: All counties from 1981 to 2018- DataRaw/countynamesandID.csvDescription: CSV file with county names, state FIPS codes, and county FIPS codesSource: US Census BureauObservations: All counties- DataRaw/statecentroids.csvDescriptions: CSV file with latitudes and longitudes of state centroidsSource: Generated by Nathan Mueller from Matlab state shapefiles using the Matlab "centroid" functionObservations: All states* Folder "DataGenerated"Description: folder for data sets that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these generated data files so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Results"Description: folder for intermediate results that are generated by the R scripts in this data and code archive. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these intermediate results so that parts of the analysis can be replicated (e.g., on computers with insufficient memory to run all parts of the analysis).* Folder "Figures"Description: folder for the figures that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these figures so that people who replicate our analysis can more easily compare the figures that they get with the figures that are presented in our paper. Additionally, this folder contains CSV files with the data that are required to reproduce the figures.* Folder "Tables"Description: folder for the tables that are generated by the R scripts in this data and code archive and that are presented in our paper. In order to reproduce our entire analysis 'from scratch', the files in this folder should be deleted. We provide these tables so that people who replicate our analysis can more easily compare the tables that they get with the tables that are presented in our paper.* Folder "logFiles"Description: the shell script runAll.sh writes the output of each R script that it runs into this folder. We provide these log files so that people who replicate our analysis can more easily compare the R output that they get with the R output that we got.* PrepareCowsData.RDescription: R script that imports the raw data set COWS.xlsx and prepares it for the further analyses* PrepareWeatherData.RDescription: R script that imports the raw data sets TMAX.csv, VPD.csv, and countynamesandID.csv, merges these three data sets, and prepares the data for the further analyses* PrepareMilkData.RDescription: R script that imports the raw data set milk_state.xlsx and prepares it for the further analyses* CalcFrequenciesTHI_Temp.RDescription: R script that calculates the frequencies of days with the different THI bins and the different temperature bins in each month for each state* CalcAvgTHI.RDescription: R script that calculates the average THI in each state* PreparePanelTHI.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different THI bins* PreparePanelTemp.RDescription: R script that creates a state-month panel/longitudinal data set with exposure to the different temperature bins* PreparePanelFinal.RDescription: R script that creates the state-month panel/longitudinal data set with all variables (e.g., THI bins, temperature bins, milk yield) that are used in our statistical analyses* EstimateTrendsTHI.RDescription: R script that estimates the trends of the frequencies of the different THI bins within our sampling period for each state in our data set* EstimateModels.RDescription: R script that estimates all model specifications that are used for generating results that are presented in the paper or for comparing or testing different model specifications* CalcCoefStateYear.RDescription: R script that calculates the effects of each THI bin on the milk yield for all combinations of states and years based on our 'final' model specification* SearchWeightMonths.RDescription: R script that estimates our 'final' model specification with different values of the weight of the temporal component relative to the weight of the spatial component in the temporally and spatially correlated error term* TestModelSpec.RDescription: R script that applies Wald tests and Likelihood-Ratio tests to compare different model specifications and creates Table S10* CreateFigure1a.RDescription: R script that creates subfigure a of Figure 1* CreateFigure1b.RDescription: R script that creates subfigure b of Figure 1* CreateFigure2a.RDescription: R script that creates subfigure a of Figure 2* CreateFigure2b.RDescription: R script that creates subfigure b of Figure 2* CreateFigure2c.RDescription: R script that creates subfigure c of Figure 2* CreateFigure3.RDescription: R script that creates the subfigures of Figure 3* CreateFigure4.RDescription: R script that creates the subfigures of Figure 4* CreateFigure5_TableS6.RDescription: R script that creates the subfigures of Figure 5 and Table S6* CreateFigureS1.RDescription: R script that creates Figure S1* CreateFigureS2.RDescription: R script that creates Figure S2* CreateTableS2_S3_S7.RDescription: R script that creates Tables S2, S3, and S7* CreateTableS4_S5.RDescription: R script that creates Tables S4 and S5* CreateTableS8.RDescription: R script that creates Table S8* CreateTableS9.RDescription: R script that creates Table S9
Facebook
TwitterScenario
The director of marketing at Cyclists, a bike-share company in Chicago, believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclists bikes differently and design a new marketing strategy to convert casual riders into annual members.
Objective/Purpose
Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. The manager and her team are interested in analyzing the Cyclists historical bike trip data to identify trends.
Business Task Identify trends to better understand user purchase behavior and recommend marketing strategies to convert casual riders into annual members.
Data sources used We have used Cyclistic’s historical trip data to analyze and identify trends. We will use 12 months of Cyclistic trip data from January 2021 to December 2021. This is public data that we will use to explore how different customer types are using Cyclists' bikes.
Documentation of any cleaning or manipulation of data The dataset from January 2021 to December 2021 is more than 1 GB in size. So it'll be somewhat difficult to perform data manipulation and transformation in spreadsheets because of the size of the file. So we can use SQL or R as they're comparatively more capable to handle heavier files. We'll be using R to perform the above-mentioned actions. sSo I have prepare the analysis using only First Quarter i.e from January-March'21.
Facebook
TwitterAttribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
The Department of Planning and Environment – Water is working to make models and data publicly available. These can be grouped into three high level categories: \r \r \r 1)\tClimate Data: The fundamental input for Water models is climate data in the form of daily rainfall and potential evapotranspiration. This data is input to water models of varying types, purposes, and complexity. The water models transform this input data to produce a range of water related modelled data. There are three sub-categories of climate data used in our water models: \r \r -\t_observed data_: The observed data is downloaded from the Silo database https://www.longpaddock.qld.gov.au/silo/ which has data from 1889-present based on recorded rainfall at thousands of locations, and derived data for various evapotranspiration data sets. We use patched-point rainfall, Morton’s wet area potential evapotranspiration, and Morton’s lake evaporation from Silo\r \r -\t_stochastic data_: The stochastic data are 10,000-year daily data sets of rainfall and potential evapotranspiration generated using observed data sets combined with palaeo-logical climate data. This work has been undertaken by researchers at University of Adelaide and University of Newcastle and used in Regional Water Strategies.\r \r -\t_stochastic data perturbed_ by results from climate models for projected greenhouse gas emission scenarios. The climate change perturbed data (1c) are 10,000-year daily data sets of rainfall and potential evapo-transpiration developed by combining the stochastic data with results reductions changes in climate based on results of the NARCliM climate models that show the greatest reductions in rainfall. Note: The Department does not own the IP of NARClim products to release any climate data (such as stochastic data perturbed) with NARClim climate projection. NARClim data is available on public domain for users to download directly such as https://climatedata-beta.environment.nsw.gov.au/\r \r 2)\tWater Models: (Not yet released) There are three subcategories of water models that we develop and maintain with catchment models the fundamental unit. These can be linked to form pre-development models of river systems, which are further developed by adding water infrastructure, demands, and management arrangements to form a full unregulated or regulated river system model.\r \r 3)\tModelled Data: (Partially released) The dataset comprises the outcomes generated by water models, encompassing a comprehensive array of findings pertaining to various aspects of the water balance. These findings encompass, but are not restricted to, factors such as flow, diversions, water storage, and allocations, with an initial emphasis on flow.\r
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A Perfectly Accurate, Synthetic dataset featuring a virtual railway EnVironment for Multi-View Stereopsis (RailEnV-PASMVS) is presented, consisting of 40 scenes and 79,800 renderings together with ground truth depth maps, extrinsic and intrinsic camera parameters and binary segmentation masks of all the track components and surrounding environment. Every scene is rendered from a set of 3 cameras, each positioned relative to the track for optimal 3D reconstruction of the rail profile. The set of cameras is translated across the 100-meter length of tangent (straight) track to yield a total of 1,995 camera views. Photorealistic lighting of each of the 40 scenes is achieved with the implementation of high-definition, high dynamic range (HDR) environmental textures. Additional variation is introduced in the form of camera focal lengths, random noise for the camera location and rotation parameters and shader modifications of the rail profile. Representative track geometry data is used to generate random and unique vertical alignment data for the rail profile for every scene. This primary, synthetic dataset is augmented by a smaller image collection consisting of 320 manually annotated photographs for improved segmentation performance. The specular rail profile represents the most challenging component for MVS reconstruction algorithms, pipelines and neural network architectures, increasing the ambiguity and complexity of the data distribution. RailEnV-PASMVS represents an application specific dataset for railway engineering, against the backdrop of existing datasets available in the field of computer vision, providing the precision required for novel research applications in the field of transportation engineering.
File descriptions
Steps to reproduce
The open source Blender software suite (https://www.blender.org/) was used to generate the dataset, with the entire pipeline developed using the exposed Python API interface. The camera trajectory is kept fixed for all 40 scenes, except for small perturbations introduced in the form of random noise to increase the camera variation. The camera intrinsic information was initially exported as a single CSV file (scene.csv) for every scene, from which the camera information files were generated; this includes the focal length (focalLengthmm), image sensor dimensions (pixelDimensionX, pixelDimensionY), position, coordinate vector (vectC) and rotation vector (vectR). The STL model files, as provided in this data repository, were exported directly from Blender, such that the geometry/scenes can be reproduced. The data processing below is written for a Python implementation, transforming the information from Blender's coordinate system into universal rotation (R_world2cv) and translation (T_world2cv) matrices.
import numpy as np
from scipy.spatial.transform import Rotation as R
#The intrinsic matrix K is constructed using the following formulation:
focalLengthPixel = focalLengthmm x pixelDimensionX / sensorWidthmm
K = [[focalLengthPixel, 0, dimX/2],
[0, focalPixel, dimY/2],
[0, 0, 1]]
#The rotation vector as provided by Blender was first transformed to a rotation matrix:
r = R.from_euler('xyz', vectR, degrees=True)
matR = r.as_matrix()
#Transpose the rotation matrix, to find matrix from the WORLD to BLENDER coordinate system:
R_world2bcam = np.transpose(matR)
#The matrix describing the transformation from BLENDER to CV/STANDARD coordinates is:
R_bcam2cv = np.array([[1, 0, 0],
[0, -1, 0],
[0, 0, -1]])
#Thus the representation from WORLD to CV/STANDARD coordinates is:
R_world2cv = R_bcam2cv.dot(R_world2bcam)
#The camera coordinate vector requires a similar transformation moving from BLENDER to WORLD coordinates:
T_world2bcam = -1 * R_world2bcam.dot(vectC)
T_world2cv = R_bcam2cv.dot(T_world2bcam)
The resulting R_world2cv and T_world2cv matrices are written to the camera information file using exactly the same format as that of BlendedMVS developed by Dr. Yao. The original rotation and translation information can be found by following the process in reverse. Note that additional steps were required to convert from Blender's unique coordinate system to that of OpenCV; this ensures universal compatibility in the way that the camera intrinsic and extrinsic information is provided.
Equivalent GPS information is provided (gps.csv), whereby the local coordinate frame is transformed into equivalent GPS information, centered around the Engineering 4.0 campus, University of Pretoria, South Africa. This information is embedded within the JPG files as EXIF data.
Facebook
Twitterhttps://cdla.io/permissive-1-0/https://cdla.io/permissive-1-0/
Case study: How does a bike-share navigate speedy success?
Scenario:
As a data analyst on Cyclistic's marketing team, our focus is on enhancing annual memberships to drive the company's success. We aim to analyze the differing usage patterns between casual riders and annual members to craft a marketing strategy aimed at converting casual riders. Our recommendations, supported by data insights and professional visualizations, await Cyclistic executives' approval to proceed.
About the company
In 2016, Cyclistic launched a bike-share program in Chicago, growing to 5,824 bikes and 692 stations. Initially, their marketing aimed at broad segments with flexible pricing plans attracting both casual riders (single-ride or full-day passes) and annual members. However, recognizing that annual members are more profitable, Cyclistic is shifting focus to convert casual riders into annual members. To achieve this, they plan to analyze historical bike trip data to understand the differences and preferences between the two user groups, aiming to tailor marketing strategies that encourage casual riders to purchase annual memberships.
Project Overview:
This capstone project is a culmination of the skills and knowledge acquired through the Google Professional Data Analytics Certification. It focuses on Track 1, which is centered around Cyclistic, a fictional bike-share company modeled to reflect real-world data analytics scenarios in the transportation and service industry.
Dataset Acknowledgment:
We are grateful to Motivate Inc. for providing the dataset that serves as the foundation of this capstone project. Their contribution has enabled us to apply practical data analytics techniques to a real-world dataset, mirroring the challenges and opportunities present in the bike-sharing sector.
Objective:
The primary goal of this project is to analyze the Cyclistic dataset to uncover actionable insights that could help the company optimize its operations, improve customer satisfaction, and increase its market share. Through comprehensive data exploration, cleaning, analysis, and visualization, we aim to identify patterns and trends that inform strategic business decisions.
Methodology:
Data Collection: Utilizing the dataset provided by Motivate Inc., which includes detailed information on bike usage, customer behavior, and operational metrics. Data Cleaning and Preparation: Ensuring the dataset is accurate, complete, and ready for analysis by addressing any inconsistencies, missing values, or anomalies. Data Analysis: Applying statistical methods and data analytics techniques to extract meaningful insights from the dataset.
Visualization and Reporting:
Creating intuitive and compelling visualizations to present the findings clearly and effectively, facilitating data-driven decision-making. Findings and Recommendations:
Conclusion:
The Cyclistic Capstone Project not only demonstrates the practical application of data analytics skills in a real-world scenario but also provides valuable insights that can drive strategic improvements for Cyclistic. Through this project, showcasing the power of data analytics in transforming data into actionable knowledge, underscoring the importance of data-driven decision-making in today's competitive business landscape.
Acknowledgments:
Special thanks to Motivate Inc. for their support and for providing the dataset that made this project possible. Their contribution is immensely appreciated and has significantly enhanced the learning experience.
STRATEGIES USED
Case Study Roadmap - ASK
●What is the problem you are trying to solve? ●How can your insights drive business decisions?
Key Tasks ● Identify the business task ● Consider key stakeholders
Deliverable ● A clear statement of the business task
Case Study Roadmap - PREPARE
● Where is your data located? ● Are there any problems with the data?
Key tasks ● Download data and store it appropriately. ● Identify how it’s organized.
Deliverable ● A description of all data sources used
Case Study Roadmap - PROCESS
● What tools are you choosing and why? ● What steps have you taken to ensure that your data is clean?
Key tasks ● Choose your tools. ● Document the cleaning process.
Deliverable ● Documentation of any cleaning or manipulation of data
Case Study Roadmap - ANALYZE
● Has your data been properly formaed? ● How will these insights help answer your business questions?
Key tasks ● Perform calculations ● Formatting
Deliverable ● A summary of analysis
Case Study Roadmap - SHARE
● Were you able to answer all questions of stakeholders? ● Can Data visualization help you share findings?
Key tasks ● Present your findings ● Create effective data viz.
Deliverable ● Supporting viz and key findings
**Case Study Roadmap - A...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The provided Python code represents the coupled framework between the discrete wavelet transform and the active subspace method. It has the goal to perform temporal scale dependent model parameter sensitivity analysis. In the provided case, the methodology is coupled to an R code containing the LuKARS model.
The folder named 'as_dwt' contains the entire source code of the methodology as well as the required data of the Kerschbaum spring case study.
The subfolder uq_tools contains supplementary python scripts that can be used for analyses that go beyond the methodology proposed in the WRR article.
The subfolder examples contains a folder called 'as_wavelets', in which the relevant python scripts are stored.
The data and the LuKARS model (R. file) can be found from this directory in 'scens/scen_main'.
The LuKARS model is given by the file 'main_exe.R.'
The precipitation and discharge data is stored in 'kerschbaum.txt'.
The monthly mean temperatures (needed for Thornthwaite's ET method) are stored in 'monthly_mean_temp.csv'.
The daily temperature values and snow depths are stored in 'snow_waidhofen.csv'.
Facebook
TwitterSWATH-MS is an acquisition and analysis technique of targeted proteomics that enables measuring several thousand proteins with high reproducibility and accuracy across many samples. OpenSWATH is popular open-source software for peptide identification and quantification from SWATH-MS data. For downstream statistical and quantitative analysis there exist different tools such as MSstats, mapDIA and aLFQ. However, the transfer of data from OpenSWATH to the downstream statistical tools is currently technically challenging. Here we introduce the R/Bioconductor package SWATH2stats, which allows convenient processing of the data into a format directly readable by the downstream analysis tools. In addition, SWATH2stats allows annotation, analyzing the variation and the reproducibility of the measurements, FDR estimation, and advanced filtering before submitting the processed data to downstream tools. These functionalities are important to quickly analyze the quality of the SWATH-MS data. Hence, SWATH2stats is a new open-source tool that summarizes several practical functionalities for analyzing, processing, and converting SWATH-MS data and thus facilitates the efficient analysis of large-scale SWATH/DIA datasets.
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
This dataset is a transformation of Greg Kolodziejzyk's remote viewing data (see Related datasets below). Greg used a "rapid-fire" technique whereby several short free-response remote viewing trials were completed in a single session. The trial-level data was transformed by Adrian Ryan into session-level Z-scores by exact binomial, in order that the data could be combined with those from other experiments, for the analyses reported here:
Three files are included:
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and R-script for a tutorial that explains how to convert spreadsheet data to tidy data. The tutorial is published in a blog for The Node (https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/)