This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This dataset was developed by NREL's distributed energy systems integration group as part of a study on high penetrations of distributed solar PV [1]. It consists of hourly load data in CSV format for use with the PNNL taxonomy of distribution feeders [2]. These feeders were developed in the open source GridLAB-D modelling language [3]. In this dataset each of the load points in the taxonomy feeders is populated with hourly averaged load data from a utility in the feederâs geographical region, scaled and randomized to emulate real load profiles. For more information on the scaling and randomization process, see [1].
The taxonomy feeders are statistically representative of the various types of distribution feeders found in five geographical regions of the U.S. Efforts are underway (possibly complete) to translate these feeders into the OpenDSS modelling language.
This data set consists of one large CSV file for each feeder. Within each CSV, each column represents one load bus on the feeder. The header row lists the name of the load bus. The subsequent 8760 rows represent the loads for each hour of the year. The loads were scaled and randomized using a Python script, so each load series represents only one of many possible randomizations. In the header row, "rl" = residential load and "cl" = commercial load. Commercial loads are followed by a phase letter (A, B, or C). For regions 1-3, the data is from 2009. For regions 4-5, the data is from 2000.
For use in GridLAB-D, each column will need to be separated into its own CSV file without a header. The load value goes in the second column, and corresponding datetime values go in the first column, as shown in the sample file, sample_individual_load_file.csv. Only the first value in the time column needs to written as an absolute time; subsequent times may be written in relative format (i.e. "+1h", as in the sample). The load should be written in P+Qj format, as seen in the sample CSV, in units of Watts (W) and Volt-amps reactive (VAr). This dataset was derived from metered load data and hence includes only real power; reactive power can be generated by assuming an appropriate power factor. These loads were used with GridLAB-D version 2.2.
Browse files in this dataset, accessible as individual files and as a single ZIP file. This dataset is approximately 242MB compressed or 475MB uncompressed.
For questions about this dataset, contact andy.hoke@nrel.gov.
If you find this dataset useful, please mention NREL and cite [1] in your work.
References:
[1] A. Hoke, R. Butler, J. Hambrick, and B. Kroposki, âSteady-State Analysis of Maximum Photovoltaic Penetration Levels on Typical Distribution Feeders,â IEEE Transactions on Sustainable Energy, April 2013, available at http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6357275 .
[2] K. Schneider, D. P. Chassin, R. Pratt, D. Engel, and S. Thompson, âModern Grid Initiative Distribution Taxonomy Final Reportâ, PNNL, Nov. 2008. Accessed April 27, 2012: http://www.gridlabd.org/models/feeders/taxonomy of prototypical feeders.pdf
[3] K. Schneider, D. Chassin, Y. Pratt, and J. C. Fuller, âDistribution power flow for smart grid technologiesâ, IEEE/PES Power Systems Conference and Exposition, Seattle, WA, Mar. 2009, pp. 1-7, 15-18.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Categorical scatterplots with R for biologists: a step-by-step guide
Benjamin Petre1, Aurore Coince2, Sophien Kamoun1
1 The Sainsbury Laboratory, Norwich, UK; 2 Earlham Institute, Norwich, UK
Weissgerber and colleagues (2015) recently stated that âas scientists, we urgently need to change our practices for presenting continuous data in small sample size studiesâ. They called for more scatterplot and boxplot representations in scientific papers, which âallow readers to critically evaluate continuous dataâ (Weissgerber et al., 2015). In the Kamoun Lab at The Sainsbury Laboratory, we recently implemented a protocol to generate categorical scatterplots (Petre et al., 2016; Dagdas et al., 2016). Here we describe the three steps of this protocol: 1) formatting of the data set in a .csv file, 2) execution of the R script to generate the graph, and 3) export of the graph as a .pdf file.
Protocol
⢠Step 1: format the data set as a .csv file. Store the data in a three-column excel file as shown in Powerpoint slide. The first column âReplicateâ indicates the biological replicates. In the example, the month and year during which the replicate was performed is indicated. The second column âConditionâ indicates the conditions of the experiment (in the example, a wild type and two mutants called A and B). The third column âValueâ contains continuous values. Save the Excel file as a .csv file (File -> Save as -> in âFile Formatâ, select .csv). This .csv file is the input file to import in R.
⢠Step 2: execute the R script (see Notes 1 and 2). Copy the script shown in Powerpoint slide and paste it in the R console. Execute the script. In the dialog box, select the input .csv file from step 1. The categorical scatterplot will appear in a separate window. Dots represent the values for each sample; colors indicate replicates. Boxplots are superimposed; black dots indicate outliers.
⢠Step 3: save the graph as a .pdf file. Shape the window at your convenience and save the graph as a .pdf file (File -> Save as). See Powerpoint slide for an example.
Notes
⢠Note 1: install the ggplot2 package. The R script requires the package âggplot2â to be installed. To install it, Packages & Data -> Package Installer -> enter âggplot2â in the Package Search space and click on âGet Listâ. Select âggplot2â in the Package column and click on âInstall Selectedâ. Install all dependencies as well.
⢠Note 2: use a log scale for the y-axis. To use a log scale for the y-axis of the graph, use the command line below in place of command line #7 in the script.
replicates
graph + geom_boxplot(outlier.colour='black', colour='black') + geom_jitter(aes(col=Replicate)) + scale_y_log10() + theme_bw()
References
Dagdas YF, Belhaj K, Maqbool A, Chaparro-Garcia A, Pandey P, Petre B, et al. (2016) An effector of the Irish potato famine pathogen antagonizes a host autophagy cargo receptor. eLife 5:e10856.
Petre B, Saunders DGO, Sklenar J, Lorrain C, Krasileva KV, Win J, et al. (2016) Heterologous Expression Screens in Nicotiana benthamiana Identify a Candidate Effector of the Wheat Yellow Rust Pathogen that Associates with Processing Bodies. PLoS ONE 11(2):e0149035
Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond Bar and Line Graphs: Time for a New Data Presentation Paradigm. PLoS Biol 13(4):e1002128
To make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Annotated 12 lead ECG dataset Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network". It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST). ## Folder content: - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exam. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V. In python, one can read this file using the following sequence: ```python import h5py with h5py.File(args.tracings, "r") as f: x = np.array(f['tracings']) ``` - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line. - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST` corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`). 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist. 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 3. `dnn.csv` prediction from the deep neural network described in "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way it maximizes the F1 score. 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset). 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset). 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the basic stand alone medicare claims public use files (bsapufs) with r and monetdb the centers for medicare and medicaid services (cms) took the plunge. the famous medicare 5% sample has been released to the public, free of charge. jfyi - medicare is the u.s. government program that provides health insurance to 50 million elderly and disabled americans. the basic stand alone medicare claims public use files (bsapufs) contain either person- or event-level data on inpatient stays, durable medical equipment purchases, prescription drug fills, hospice users, doctor visits, home health provision , outpatient hospital procedures, skilled nursing facility short-term residents, as well as aggregated statistics for medicare beneficiaries with chronic conditions and medicare beneficiaries living in nursing homes. oh sorry, there's one catch: they only provide sas scripts to analyze everything. cue the villian music. that bored old game of monopoly ends today. the initial release of the 2008 bsapufs was accompanied by some major fanfare in the world of health policy , a big win for government transparency. unfortunately, the final files that cleared the confidentiality hurdles are heavily de-identified and obfuscated. prime examples: none of the files can be linked to any other file. not across years, not across expenditure categories costs are rounded to the nearest fifth or tenth dollar at lower values, nearest thousandth at higher values ages are categorized into five year bands so these files are baldly inferior to the unsquelched, linkable data only available through an expensive formal application process. any researcher with a budget flush enough to afford a sas license (the only statistical software mentioned in the cms official documentation) can probably also cough up the money to buy the identifiable data through resdac (resdac, btw, rocks). soapbox: cms released free public data sets that could only be analyzed with a software package costing thousands of dollars. so even though the actual data sets were free, researchers still needed deep pock ets to buy sas. meanwhile, the unsquelched and therefore superior data sets are also available for many thousands of dollars. researchers with funding would (reasonably) just buy the better data. researchers without any financial resources - the target audience of free, public data - were left out in the cold. no wonder these bsapufs haven't been used much. that ends now. using r, monetdb, and the personal computer you already own (mine cost $700 in 2009), researchers can, for the first time, seriously analyze these medicare public use files without spending another dime. woah. plus hey guess what all you researcher fat-cats with your federal grant streams and your proprietary software licenses: r + monetdb runs one heckuva lot faster than sas. woah^2. dump your sas license water wings and learn how to swim. the scripts below require monetdb . click here for step-by-step instructions of how to install it on windows and click here for speed tests. vroom. since the bsapufs comprise 5% of the medicare population, ya generally need to multiply any counts or sums by twenty. although the individuals represented in these claims are randomly sampled, this data should not be treated like a complex survey sample, meaning that the creation of a survey object is unnecessary. most bsapufs generalize to either the total or fee-for-service medicare population, but each file is different so give the documentation a hard stare before that eureka moment. this new github repository contains three scripts: 2008 - download all csv files.R loop through and download every zip file hosted by cms unzip the contents of each zipped file to the working directory 2008 - import all csv files into monetdb.R create the batch (.bat) file needed to initiate the monet database in the f uture loop through each csv file in the current working directory and import them into the monet database create a well-documented block of code to re-initiate the monetdb server in the future 2008 - replicate cms publications.R initiate the same monetdb server instance, unsing the same well-documented block of code as above replicate nine sets of statistics found in data tables provided by cms < a href="https://github.com/ajdamico/usgsd/tree/master/Basic%20Stand%20Alone%20Medicare%20Claims%20Public%20Use%20Files">click here to view these three scripts for more detail about the basic stand alone medicare claims public use files (bsapufs), visit: the centers for medicare and medicaid's bsapuf homepage a joint academyhealth webinar given by the organizations that partnered to create these files - cms, impaq, norc notes: the replication script has oodles of easily-modified syntax and should be viewed for analysis examples. if you know the name of the data table you want to examine, you can quickly modify these general monetdb analysis examples too. just run sql queries - sas users, that's "proc...
This module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Explanation/Overview: Corresponding dataset for the analyses and results achieved in the CS Track project in the research line on participation analyses, which is also reported in the publication "Does Volunteer Engagement Pay Off? An Analysis of User Participation in Online Citizen Science Projects", a conference paper for the conference CollabTech 2022: Collaboration Technologies and Social Computing and published as part of the Lecture Notes in Computer Science book series (LNCS,volume 13632) here. The usernames have been anonymised. Purpose: The purpose of this dataset is to provide the basis to reproduce the results reported in the associated deliverable, and in the above-mentioned publication. As such, it does not represent raw data, but rather files that already include certain analysis steps (like calculated degrees or other SNA-related measures), ready for analysis, visualisation and interpretation with R. Relatedness: The data of the different projects was derived from the forums of 7 Zooniverse projects based on similar discussion board features. The projects are: 'Galaxy Zoo', 'Gravity Spy', 'Seabirdwatch', 'Snapshot Wisconsin', 'Wildwatch Kenya', 'Galaxy Nurseries', 'Penguin Watch'. Content: In this Zenodo entry, several files can be found. The structure is as follows (files and folders and descriptions). corresponding_calculations.html Quarto-notebook to view in browser corresponding_calculations.qmd Quarto-notebook to view in RStudio assets data annotations annotations.csv List of annotations made per day for each of the analysed projects comments comments.csv Total list of comments with several data fields (i.e., comment id, text, reply_user_id) rolechanges 478_rolechanges.csv List of roles per user to determine number of role changes 1104_rolechanges.csv ... ... totalnetworkdata Edges 478_edges.csv Network data (edge set) for the given projects (without time slices) 1104_edges.csv ... ... Nodes 478_nodes.csv Network data (node set) for the given projects (without time slices) 1104_nodes.csv ... ... trajectories Network data (edge and node sets) for the given projects and all time slices (Q1 2016 - Q4 2021) 478 Edges edges_4782016_q1.csv edges_4782016_q2.csv edges_4782016_q3.csv edges_4782016_q4.csv ... Nodes nodes_4782016_q1.csv nodes_4782016_q4.csv nodes_4782016_q3.csv nodes_4782016_q2.csv ... 1104 Edges ... Nodes ... ... scripts datavizfuncs.R script for the data visualisation functions, automatically executed from within corresponding_calculations.qmd import.R script for the import of data, automatically executed from within corresponding_calculations.qmd corresponding_calculations_files files for the html/qmd view in the browser/RStudio Grouping: The data is grouped according to given criteria (e.g., project_title or time). Accordingly, the respective files can be found in the data structure
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Reddit Diary of a Redditor VAD Dataset Dataset Creation Process
Scraping Reddit Posts
Posts were scraped from the r/diaryofaredditor subreddit using the Reddit API. The script used for scraping is shown below:import requests import csv import time
access_token = "" headers = { "Authorization": f"bearer {access_token}", "User-Agent": "ChangeMeClient/0.1" }
url = "https://oauth.reddit.com/r/diaryofaredditor/new" params = {"limit": 100} after = None
csv_path =⌠See the full description on the dataset page: https://huggingface.co/datasets/mmarkusmalone/journal-entries-emotion-detection-vad.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains all the spectra used in the paper "Repeated double cross validation applied to the PCA-LDA classification of SERS spectra: a case study with serum samples from hepatocellular carcinoma patients", plus the R code to import the TXT (ASCII) files into a dataset, preprocess data, set-up and cross validate the PCA-LDA model and generate the figures shown in the paper.
Data are available in 2 different formats:
- 1 compressed archive ("dataset.zip") containing all the 144 TXT files (1 file = 1 spectrum)
- 1 single CSV file (âdataset.csvâ) with all the 144 spectra in the form of a table. The data are structured as follow, with each row being 1 spectrum, preceded by metadata: "acquisition_date", "substrate_batch", "class", "sample_code".
The code for R is available as a single file "Rcode.R".
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
FosSahul 2.0 database and R code accompanying manuscript "FosSahul 2.0, an updated database for the Late Quaternary fossil records of Sahul" submitted to Scientific Data. Excel files: FosSahul2.0.csv: FosSahul database collating non-human vertebrate megafauna fossil records for the Sahul region. Note that location data have been rounded to one degree decimal and might not reflect the exact location of the fossil record. For more information on precise locations, contact the authors.FosSahul2.0_metadata.xlsx: Column description and further detail on the FosSahul 2.0 database.CalibratedC14Dates_FosSahul.csv: Calibrated radiocarbon dates for FosSahul 2.0. Needed for the calculation of the biodiversity index.TimeBins.csv: Time bins needed for the calculation of the biodiversity index. R-scripts: FosSahul_Rating.R: Quality-rating algorithm for the FosSahul database.FosSahul_Data import.R: Data import script necessary for the calculation of the biodiversity index.FosSahul_Biodiversity_index_calculation.R: Code for the calculation of the biodiversity index.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Objective: To develop a clinical informatics pipeline designed to capture large-scale structured EHR data for a national patient registry.
Materials and Methods: The EHR-R-REDCap pipeline is implemented using R-statistical software to remap and import structured EHR data into the REDCap-based multi-institutional Merkel Cell Carcinoma (MCC) Patient Registry using an adaptable data dictionary.
Results: Clinical laboratory data were extracted from EPIC Clarity across several participating institutions. Labs were transformed, remapped and imported into the MCC registry using the EHR labs abstraction (eLAB) pipeline. Forty-nine clinical tests encompassing 482,450 results were imported into the registry for 1,109 enrolled MCC patients. Data-quality assessment revealed highly accurate, valid labs. Univariate modeling was performed for labs at baseline on overall survival (N=176) using this clinical informatics pipeline.
Conclusion: We demonstrate feasibility of the facile eLAB workflow. EHR data is successfully transformed, and bulk-loaded/imported into a REDCap-based national registry to execute real-world data analysis and interoperability.
Methods eLAB Development and Source Code (R statistical software):
eLAB is written in R (version 4.0.3), and utilizes the following packages for processing: DescTools, REDCapR, reshape2, splitstackshape, readxl, survival, survminer, and tidyverse. Source code for eLAB can be downloaded directly (https://github.com/TheMillerLab/eLAB).
eLAB reformats EHR data abstracted for an identified population of patients (e.g. medical record numbers (MRN)/name list) under an Institutional Review Board (IRB)-approved protocol. The MCCPR does not host MRNs/names and eLAB converts these to MCCPR assigned record identification numbers (record_id) before import for de-identification.
Functions were written to remap EHR bulk lab data pulls/queries from several sources including Clarity/Crystal reports or institutional EDW including Research Patient Data Registry (RPDR) at MGB. The input, a csv/delimited file of labs for user-defined patients, may vary. Thus, users may need to adapt the initial data wrangling script based on the data input format. However, the downstream transformation, code-lab lookup tables, outcomes analysis, and LOINC remapping are standard for use with the provided REDCap Data Dictionary, DataDictionary_eLAB.csv. The available R-markdown ((https://github.com/TheMillerLab/eLAB) provides suggestions and instructions on where or when upfront script modifications may be necessary to accommodate input variability.
The eLAB pipeline takes several inputs. For example, the input for use with the âehr_format(dt)â single-line command is non-tabular data assigned as R object âdtâ with 4 columns: 1) Patient Name (MRN), 2) Collection Date, 3) Collection Time, and 4) Lab Results wherein several lab panels are in one data frame cell. A mock dataset in this âuntidy-formatâ is provided for demonstration purposes (https://github.com/TheMillerLab/eLAB).
Bulk lab data pulls often result in subtypes of the same lab. For example, potassium labs are reported as âPotassium,â âPotassium-External,â âPotassium(POC),â âPotassium,whole-bld,â âPotassium-Level-External,â âPotassium,venous,â and âPotassium-whole-bld/plasma.â eLAB utilizes a key-value lookup table with ~300 lab subtypes for remapping labs to the Data Dictionary (DD) code. eLAB reformats/accepts only those lab units pre-defined by the registry DD. The lab lookup table is provided for direct use or may be re-configured/updated to meet end-user specifications. eLAB is designed to remap, transform, and filter/adjust value units of semi-structured/structured bulk laboratory values data pulls from the EHR to align with the pre-defined code of the DD.
Data Dictionary (DD)
EHR clinical laboratory data is captured in REDCap using the âLabsâ repeating instrument (Supplemental Figures 1-2). The DD is provided for use by researchers at REDCap-participating institutions and is optimized to accommodate the same lab-type captured more than once on the same day for the same patient. The instrument captures 35 clinical lab types. The DD serves several major purposes in the eLAB pipeline. First, it defines every lab type of interest and associated lab unit of interest with a set field/variable name. It also restricts/defines the type of data allowed for entry for each data field, such as a string or numerics. The DD is uploaded into REDCap by every participating site/collaborator and ensures each site collects and codes the data the same way. Automation pipelines, such as eLAB, are designed to remap/clean and reformat data/units utilizing key-value look-up tables that filter and select only the labs/units of interest. eLAB ensures the data pulled from the EHR contains the correct unit and format pre-configured by the DD. The use of the same DD at every participating site ensures that the data field code, format, and relationships in the database are uniform across each site to allow for the simple aggregation of the multi-site data. For example, since every site in the MCCPR uses the same DD, aggregation is efficient and different site csv files are simply combined.
Study Cohort
This study was approved by the MGB IRB. Search of the EHR was performed to identify patients diagnosed with MCC between 1975-2021 (N=1,109) for inclusion in the MCCPR. Subjects diagnosed with primary cutaneous MCC between 2016-2019 (N= 176) were included in the test cohort for exploratory studies of lab result associations with overall survival (OS) using eLAB.
Statistical Analysis
OS is defined as the time from date of MCC diagnosis to date of death. Data was censored at the date of the last follow-up visit if no death event occurred. Univariable Cox proportional hazard modeling was performed among all lab predictors. Due to the hypothesis-generating nature of the work, p-values were exploratory and Bonferroni corrections were not applied.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
analyze the consumer expenditure survey (ce) with r the consumer expenditure survey (ce) is the primo data source to understand how americans spend money. participating households keep a running diary about every little purchase over the year. those diaries are then summed up into precise expenditure categories. how else are you gonna know that the average american household spent $34 (Âą2) on bacon, $826 (Âą17) on cellular phones, and $13 (Âą2) on digital e-readers in 2011? an integral component of the market basket calculation in the consumer price index, this survey recently became available as public-use microdata and they're slowly releasing historical files back to 1996. hooray! for a t aste of what's possible with ce data, look at the quick tables listed on their main page - these tables contain approximately a bazillion different expenditure categories broken down by demographic groups. guess what? i just learned that americans living in households with $5,000 to $9,999 of annual income spent an average of $283 (Âą90) on pets, toys, hobbies, and playground equipment (pdf page 3). you can often get close to your statistic of interest from these web tables. but say you wanted to look at domestic pet expenditure among only households with children between 12 and 17 years old. another one of the thirteen web tables - the consumer unit composition table - shows a few different breakouts of households with kids, but none matching that exact population of interest. the bureau of labor statistics (bls) (the survey's designers) and the census bureau (the survey's administrators) have provided plenty of the major statistics and breakouts for you, but they're not psychic. if you want to comb through this data for specific expenditure categories broken out by a you-defined segment of the united states' population, then let a little r into your life. fun starts now. fair warning: only analyze t he consumer expenditure survey if you are nerd to the core. the microdata ship with two different survey types (interview and diary), each containing five or six quarterly table formats that need to be stacked, merged, and manipulated prior to a methodologically-correct analysis. the scripts in this repository contain examples to prepare 'em all, just be advised that magnificent data like this will never be no-assembly-required. the folks at bls have posted an excellent summary of what's av ailable - read it before anything else. after that, read the getting started guide. don't skim. a few of the descriptions below refer to sas programs provided by the bureau of labor statistics. you'll find these in the C:\My Directory\CES\2011\docs directory after you run the download program. this new github repository contains three scripts: 2010-2011 - download all microdata.R lo op through every year and download every file hosted on the bls's ce ftp site import each of the comma-separated value files into r with read.csv depending on user-settings, save each table as an r data file (.rda) or stat a-readable file (.dta) 2011 fmly intrvw - analysis examples.R load the r data files (.rda) necessary to create the 'fmly' table shown in the ce macros program documentation.doc file construct that 'fmly' table, using five quarters of interviews (q1 2011 thru q1 2012) initiate a replicate-weighted survey design object perform some lovely li'l analysis examples replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using unimputed variables replicate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t -tests using unimputed variables create an rsqlite database (to minimize ram usage) containing the five imputed variable files, after identifying which variables were imputed based on pdf page 3 of the user's guide to income imputation initiate a replicate-weighted, database-backed, multiply-imputed survey design object perform a few additional analyses that highlight the modified syntax required for multiply-imputed survey designs replicate the %mean_variance() macro found in "ce macros.sas" and provide some examples of calculating descriptive statistics using imputed variables repl icate the %compare_groups() macro found in "ce macros.sas" and provide some examples of performing t-tests using imputed variables replicate the %proc_reg() and %proc_logistic() macros found in "ce macros.sas" and provide some examples of regressions and logistic regressions using both unimputed and imputed variables replicate integrated mean and se.R match each step in the bls-provided sas program "integr ated mean and se.sas" but with r instead of sas create an rsqlite database when the expenditure table gets too large for older computers to handle in ram export a table "2011 integrated mean and se.csv" that exactly matches the contents of the sas-produced "2011 integrated mean and se.lst" text file click here to view these three scripts for...
[Note 2023-08-14 - Supersedes version 1, https://doi.org/10.15482/USDA.ADC/1528086 ] This dataset contains all code and data necessary to reproduce the analyses in the manuscript: Mengistu, A., Read, Q. D., Sykes, V. R., Kelly, H. M., Kharel, T., & Bellaloui, N. (2023). Cover crop and crop rotation effects on tissue and soil population dynamics of Macrophomina phaseolina and yield under no-till system. Plant Disease. https://doi.org/10.1094/pdis-03-23-0443-re The .zip archive cropping-systems-1.0.zip contains data and code files. Data stem_soil_CFU_by_plant.csv: Soil disease load (SoilCFUg) and stem tissue disease load (StemCFUg) for individual plants in CFU per gram, with columns indicating year, plot ID, replicate, row, plant ID, previous crop treatment, cover crop treatment, and comments. Missing data are indicated with . yield_CFU_by_plot.csv: Yield data (YldKgHa) at the plot level in units of kg/ha, with columns indicating year, plot ID, replicate, and treatments, as well as means of soil and stem disease load at the plot level. Code cropping_system_analysis_v3.0.Rmd: RMarkdown notebook with all data processing, analysis, and visualization code equations.Rmd: RMarkdown notebook with formatted equations formatted_figs_revision.R: R script to produce figures formatted exactly as they appear in the manuscript The Rproject file cropping-systems.Rproj is used to organize the RStudio project. Scripts and notebooks used in older versions of the analysis are found in the testing/ subdirectory. Excel spreadsheets containing raw data from which the cleaned CSV files were created are found in the raw_data subdirectory.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example code list definition in csv format.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The useNews dataset has been compiled to enable the study of online news engagement. It relies on the MediaCloud and CrowdTangle APIs as well as on data from the Reuters Digital News Report. The entire dataset builds on data from 2019 and 2020 as well as a total of 12 countries. It is free to use (subject to citing/referencing it).
The data originates from both the 2019 and the 2020 Reuters Digital News Report (http://www.digitalnewsreport.org/), media content from MediaCloud (https://mediacloud.org/) for 2019 and 2020 from all news outlets that have been used most frequently in the respective year according to the survey data, and engagement metrics for all available news-article URLs through CrowdTangle (https://www.crowdtangle.com/).
To start using the data, a total of eight data objects exist, namely one each for 2019 and 2020 for the survey, news-article meta information, news-article DFM's, and engagement metrics. To make your life easy, we've provided several packaged download options:
Also, if you are working with R, we have prepared a simple file to automatically download all necessary data (~1.5 GByte) at once: https://osf.io/fxmgq/
Note that all .rds files are .xz-compressed, which shouldn't bother you when you are in R. You can import all the .rds files through variable_name <- readRDS('filename.rds')
, .RData (also .xz-compressed) can be imported by simply using load('filename.RData')
which will load several already named objects into your R environment. To import data through other programming languages, we also provide all data in respective CSV files. These files are rather large, however, which is why we have also .xz-compressed them. DFM's, unfortunately, are not available as CSV's due to their sparsity and size.
Find out more about the data variables and dig into plenty of examples in the useNews-examples workbook: https://osf.io/snuk2/
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
The global wildlife trade network is a massive system that has been shown to threaten biodiversity, introduce non-native species and pathogens, and cause chronic animal welfare concerns. Despite its scale and impact, comprehensive characterization of the global wildlife trade is hampered by data that are limited in their temporal or taxonomic scope and detail. To help fill this gap, we present data on 15 years of the importation of wildlife and their derived products into the United States (2000â2014), originally collected by the United States Fish and Wildlife Service. We curated and cleaned the data and added taxonomic information to improve data usability. These data include >2 million wildlife or wildlife product shipments, representing >60 biological classes and >3.2 billion live organisms. Further, the majority of species in the dataset are not currently reported on by CITES parties. These data will be broadly useful to both scientists and policymakers seeking to better understand the volume, sources, biological composition, and potential risks of the global wildlife trade. Resources in this dataset:Resource Title: United States LEMIS wildlife trade data curated by EcoHealth Alliance (Version 1.1.0) - Zenodo. File Name: Web Page, url: https://doi.org/10.5281/zenodo.3565869 Over 5.5 million USFWS LEMIS wildlife or wildlife product records spanning 15 years and 28 data fields. These records were derived from >2 million unique shipments processed by USFWS during the time period and represent >3.2 billion live organisms. We provide the final cleaned data as a single comma-separated value file. Original raw data as provided by the USFWS are also available. Although relatively large (~1 gigabyte), the cleaned data file can be imported into a software environment of choice for data analysis. Alternatively, the assocated R package provides access to a release of the same cleaned dataset but with a data download and manipulation framework that is designed to work well with this large dataset. Both the Zenodo data repository and the R package contain a metadata file describing each of the data fields as well as a lookup table to retrieve full values for the abbreviated codes used throughout the dataset. Contents: lemis_2000_2014_cleaned.csv: This file represents the compiled, cleaned LEMIS data from 2000-2014. This data is identical to the version 1.1.0 dataset available through the lemis R package. lemis_codes.csv: Full values for all coded values used in the LEMIS data. Identical to the output from the lemis R package function "lemis_codes()". lemis_metadata.csv: Data fields and field descriptions for all variables in the LEMIS data. Identical to the output from the lemis R package function "lemis_metadata()". raw_data.zip: This archive contains all of the raw LEMIS data files that are processed and cleaned with the code contained in the 'data-raw' subdirectory of the lemis R package repository.Resource Software Recommended: R package,url: https://github.com/ecohealthalliance/lemis
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The LSC (Leicester Scientific Corpus)
April 2020 by Neslihan Suzen, PhD student at the University of Leicester (ns433@leicester.ac.uk) Supervised by Prof Alexander Gorban and Dr Evgeny MirkesThe data are extracted from the Web of Science [1]. You may not copy or distribute these data in whole or in part without the written consent of Clarivate Analytics.[Version 2] A further cleaning is applied in Data Processing for LSC Abstracts in Version 1*. Details of cleaning procedure are explained in Step 6.* Suzen, Neslihan (2019): LSC (Leicester Scientific Corpus). figshare. Dataset. https://doi.org/10.25392/leicester.data.9449639.v1.Getting StartedThis text provides the information on the LSC (Leicester Scientific Corpus) and pre-processing steps on abstracts, and describes the structure of files to organise the corpus. This corpus is created to be used in future work on the quantification of the meaning of research texts and make it available for use in Natural Language Processing projects.LSC is a collection of abstracts of articles and proceeding papers published in 2014, and indexed by the Web of Science (WoS) database [1]. The corpus contains only documents in English. Each document in the corpus contains the following parts:1. Authors: The list of authors of the paper2. Title: The title of the paper 3. Abstract: The abstract of the paper 4. Categories: One or more category from the list of categories [2]. Full list of categories is presented in file âList_of _Categories.txtâ. 5. Research Areas: One or more research area from the list of research areas [3]. Full list of research areas is presented in file âList_of_Research_Areas.txtâ. 6. Total Times cited: The number of times the paper was cited by other items from all databases within Web of Science platform [4] 7. Times cited in Core Collection: The total number of times the paper was cited by other papers within the WoS Core Collection [4]The corpus was collected in July 2018 online and contains the number of citations from publication date to July 2018. We describe a document as the collection of information (about a paper) listed above. The total number of documents in LSC is 1,673,350.Data ProcessingStep 1: Downloading of the Data Online
The dataset is collected manually by exporting documents as Tab-delimitated files online. All documents are available online.Step 2: Importing the Dataset to R
The LSC was collected as TXT files. All documents are extracted to R.Step 3: Cleaning the Data from Documents with Empty Abstract or without CategoryAs our research is based on the analysis of abstracts and categories, all documents with empty abstracts and documents without categories are removed.Step 4: Identification and Correction of Concatenate Words in AbstractsEspecially medicine-related publications use âstructured abstractsâ. Such type of abstracts are divided into sections with distinct headings such as introduction, aim, objective, method, result, conclusion etc. Used tool for extracting abstracts leads concatenate words of section headings with the first word of the section. For instance, we observe words such as ConclusionHigher and ConclusionsRT etc. The detection and identification of such words is done by sampling of medicine-related publications with human intervention. Detected concatenate words are split into two words. For instance, the word âConclusionHigherâ is split into âConclusionâ and âHigherâ.The section headings in such abstracts are listed below:
Background Method(s) Design Theoretical Measurement(s) Location Aim(s) Methodology Process Abstract Population Approach Objective(s) Purpose(s) Subject(s) Introduction Implication(s) Patient(s) Procedure(s) Hypothesis Measure(s) Setting(s) Limitation(s) Discussion Conclusion(s) Result(s) Finding(s) Material (s) Rationale(s) Implications for health and nursing policyStep 5: Extracting (Sub-setting) the Data Based on Lengths of AbstractsAfter correction, the lengths of abstracts are calculated. âLengthâ indicates the total number of words in the text, calculated by the same rule as for Microsoft Word âword countâ [5].According to APA style manual [6], an abstract should contain between 150 to 250 words. In LSC, we decided to limit length of abstracts from 30 to 500 words in order to study documents with abstracts of typical length ranges and to avoid the effect of the length to the analysis.
Step 6: [Version 2] Cleaning Copyright Notices, Permission polices, Journal Names and Conference Names from LSC Abstracts in Version 1Publications can include a footer of copyright notice, permission policy, journal name, licence, authorâs right or conference name below the text of abstract by conferences and journals. Used tool for extracting and processing abstracts in WoS database leads to attached such footers to the text. For example, our casual observation yields that copyright notices such as âPublished by Elsevier ltd.â is placed in many texts. To avoid abnormal appearances of words in further analysis of words such as bias in frequency calculation, we performed a cleaning procedure on such sentences and phrases in abstracts of LSC version 1. We removed copyright notices, names of conferences, names of journals, authorsâ rights, licenses and permission policies identiďŹed by sampling of abstracts.Step 7: [Version 2] Re-extracting (Sub-setting) the Data Based on Lengths of AbstractsThe cleaning procedure described in previous step leaded to some abstracts having less than our minimum length criteria (30 words). 474 texts were removed.Step 8: Saving the Dataset into CSV FormatDocuments are saved into 34 CSV files. In CSV files, the information is organised with one record on each line and parts of abstract, title, list of authors, list of categories, list of research areas, and times cited is recorded in fields.To access the LSC for research purposes, please email to ns433@le.ac.uk.References[1]Web of Science. (15 July). Available: https://apps.webofknowledge.com/ [2]WoS Subject Categories. Available: https://images.webofknowledge.com/WOKRS56B5/help/WOS/hp_subject_category_terms_tasca.html [3]Research Areas in WoS. Available: https://images.webofknowledge.com/images/help/WOS/hp_research_areas_easca.html [4]Times Cited in WoS Core Collection. (15 July). Available: https://support.clarivate.com/ScientificandAcademicResearch/s/article/Web-of-Science-Times-Cited-accessibility-and-variation?language=en_US [5]Word Count. Available: https://support.office.com/en-us/article/show-word-count-3c9e6a11-a04d-43b4-977c-563a0e0d5da3 [6]A. P. Association, Publication manual. American Psychological Association Washington, DC, 1983.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This zip file contains: - 3 .zip files = projects to be imported into SmartPLS 3
DLOQ-A model with 7 dimensions DLOQ-A model with second-order latent variable ECSI model (Tenenhaus et al., 2005) to exemplify direct, indirect and total effects, as well as importance-performance map and moderation with continuous variables. ECSI Model (Sanches, 2013) to exemplify MGA (multi-group analysis)
Note: - DLOQ-A = new dataset (ours) - ECSI-Tenenhaus et al. [model for mediation and moderation] = available at: http://www.smartpls.com > Resources > SmartPLS Project Examples - ECSI-Sanches [dataset for MGA] = available in the software R > library(plspm) > data(satisfaction)
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files