https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Database of Uniaxial Cyclic and Tensile Coupon Tests for Structural Metallic Materials
Background
This dataset contains data from monotonic and cyclic loading experiments on structural metallic materials. The materials are primarily structural steels and one iron-based shape memory alloy is also included. Summary files are included that provide an overview of the database and data from the individual experiments is also included.
The files included in the database are outlined below and the format of the files is briefly described. Additional information regarding the formatting can be found through the post-processing library (https://github.com/ahartloper/rlmtp/tree/master/protocols).
Usage
Included Files
File Format: Downsampled Data
These are the "LP_
These data files can be easily loaded using the pandas library in Python through:
import pandas
data = pandas.read_csv(data_file, index_col=0)
The data is formatted so it can be used directly in RESSPyLab (https://github.com/AlbanoCastroSousa/RESSPyLab). Note that the column names "e_true" and "Sigma_true" were kept for backwards compatibility reasons with RESSPyLab.
File Format: Unreduced Data
These are the "LP_
The data can be loaded and used similarly to the downsampled data.
File Format: Overall_Summary
The overall summary file provides data on all the test specimens in the database. The columns include:
File Format: Summarized_Mechanical_Props_Campaign
Meant to be loaded in Python as a pandas DataFrame with multi-indexing, e.g.,
tab1 = pd.read_csv('Summarized_Mechanical_Props_Campaign_' + date + version + '.csv',
index_col=[0, 1, 2, 3], skipinitialspace=True, header=[0, 1],
keep_default_na=False, na_values='')
Caveats
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The useNews dataset has been compiled to enable the study of online news engagement. It relies on the MediaCloud and CrowdTangle APIs as well as on data from the Reuters Digital News Report. The entire dataset builds on data from 2019 and 2020 as well as a total of 12 countries. It is free to use (subject to citing/referencing it).
The data originates from both the 2019 and the 2020 Reuters Digital News Report (http://www.digitalnewsreport.org/), media content from MediaCloud (https://mediacloud.org/) for 2019 and 2020 from all news outlets that have been used most frequently in the respective year according to the survey data, and engagement metrics for all available news-article URLs through CrowdTangle (https://www.crowdtangle.com/).
To start using the data, a total of eight data objects exist, namely one each for 2019 and 2020 for the survey, news-article meta information, news-article DFM's, and engagement metrics. To make your life easy, we've provided several packaged download options:
Also, if you are working with R, we have prepared a simple file to automatically download all necessary data (~1.5 GByte) at once: https://osf.io/fxmgq/
Note that all .rds files are .xz-compressed, which shouldn't bother you when you are in R. You can import all the .rds files through variable_name <- readRDS('filename.rds')
, .RData (also .xz-compressed) can be imported by simply using load('filename.RData')
which will load several already named objects into your R environment. To import data through other programming languages, we also provide all data in respective CSV files. These files are rather large, however, which is why we have also .xz-compressed them. DFM's, unfortunately, are not available as CSV's due to their sparsity and size.
Find out more about the data variables and dig into plenty of examples in the useNews-examples workbook: https://osf.io/snuk2/
This dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826. Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464). Resources in this dataset:Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format. File Name: PBMC7_AllCells.zipResource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows: matrix of gene counts* (matrix.mtx.gx) gene names (features.tsv.gz) cell IDs (barcodes.tsv.gz) *The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation. Data can be read into R using the function Read10X().Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata. File Name: PBMC7_AllCells_meta.csvResource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include: nCount_RNA = the number of transcripts detected in a cell nFeature_RNA = the number of genes detected in a cell Loupe = cell barcodes; correspond to the cell IDs found in the .h5Seurat and 10X formatted objects for all cells prcntMito = percent mitochondrial reads in a cell Scrublet = doublet probability score assigned to a cell seurat_clusters = cluster ID assigned to a cell PaperIDs = sample ID for a cell celltypes = cell type ID assigned to a cellResource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates. File Name: PBMC7_AllCells_PCAcoord.csvResource Description: .csv file containing first 100 PCA coordinates for cells. Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates. File Name: PBMC7_AllCells_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates. File Name: PBMC7_AllCells_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for all cells.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates. File Name: PBMC7_CD4only_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates. File Name: PBMC7_CD4only_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates. File Name: PBMC7_GDonly_UMAPcoord.csvResource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates. File Name: PBMC7_GDonly_tSNEcoord.csvResource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information. File Name: UnfilteredGeneInfo.txtResource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat. File Name: PBMC7.tarResource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
This child page contains a zipped folder which contains all of the items necessary to run load estimation using R-LOADEST to produce results that are published in U.S. Geological Survey Investigations Report 2021-XXXX [Tatge, W.S., Nustad, R.A., and Galloway, J.M., 2021, Evaluation of Salinity and Nutrient Conditions in the Heart River Basin, North Dakota, 1970-2020: U.S. Geological Survey Scientific Investigations Report 2021-XXXX, XX p]. The folder contains an allsiteinfo.table.csv file, a "datain" folder, and a "scripts" folder. The allsiteinfo.table.csv file can be used to cross reference the sites with the main report (Tatge and others, 2021). The "datain" folder contains all the input data necessary to reproduce the load estimation results. The naming convention in the "datain" folder is site_MI_rloadest or site_NUT_rloadest for either the major ion loads or the nutrient loads. The .Rdata files are used in the scripts to run the estimations and the .csv files can be used to look at the data. The "scripts" folder contains the written R scripts to produce the results of the load estimation from the main report. R-LOADEST is a software package for analyzing loads in streams and an accompanying report (Runkel and others, 2004) serves as the formal documentation for R-LOADEST. The package is a collection of functions written in R (R Development Core Team, 2019), an open source language and a general environment for statistical computing and graphics. The following system requirements are necessary for producing results: Windows 10 operating system R (version 3.4 or later; 64-bit recommended) RStudio (version 1.1.456 or later) R-LOADEST program (available at https://github.com/USGS-R/rloadest). Runkel, R.L., Crawford, C.G., and Cohn, T.A., 2004, Load Estimator (LOADEST): A FORTRAN Program for Estimating Constituent Loads in Streams and Rivers: U.S. Geological Survey Techniques and Methods Book 4, Chapter A5, 69 p., [Also available at https://pubs.usgs.gov/tm/2005/tm4A5/pdf/508final.pdf.] R Development Core Team, 2019, R—A language and environment for statistical computing: Vienna, Austria, R Foundation for Statistical Computing, accessed December 7, 2020, at https://www.r-project.org.
This project is a collection of files to allow users to reproduce the model development and benchmarking in "Dawnn: single-cell differential abundance with neural networks" (Hall and Castellano, under review). Dawnn is a tool for detecting differential abundance in single-cell RNAseq datasets. It is available as an R package here. Please contact us if you are unable to reproduce any of the analysis in our paper. The files in this collection correspond to the benchmarking dataset based on single-cell RNAseq of mouse emrbyo cells. FILES: Input data Dataset from: "A single-cell molecular map of mouse gastrulation and early organogenesis". Nature 566, pp490–495 (2019). The input data is loaded from the MouseGastrulationData R package. We upload here the RDS file generated by loading the dataset in process_mouse_cells.R in case the R package becomes unavailable MouseGastrulationData_loaded_dataset.RDS Dataset loaded from MouseGastrulationData R package in process_mouse_cells.R (in call to EmbryoAtlasData function). Data processing code process_mouse_cells.R Generates benchmarking dataset from input data. (Loads input data; Runs the standard single-cell RNAseq pipeline). Follows Dann et al. Resulting dataset saved as mouse_gastrulation_data_regen.RDS. simulate_mouse_pc1_Rscript.R R code to simulate P(Condition_1)s for benchmarking. simulate_mouse_pc1_bash.sh Bash script to execute simulate_mouse_pc1_Rscript.R. Outputs stored in benchmark_dataset_mouse_pc1s_regen.csv. simulate_mouse_labels_Rscript.R R code to simulate labels for benchmarking. simulate_mouse_labels_bash.sh Bash script to execute simulate_mouse_labels_Rscript.R. Outputs stored in benchmark_dataset_mouse.csv. Resulting datasets mouse_gastrulation_data_regen.RDS Seurat dataset generated by process_mouse_cells.R. benchmark_dataset_mouse.csv Cell labels generated by simulate_mouse_labels_bash.sh. benchmark_dataset_mouse_pc1s_regen.csv P(Condition_1)s generated by simulate_mouse_pc1_bash.sh.
This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).
File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).
R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:
CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions
CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)
CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations
Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):
CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)
CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available
CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)
CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV
R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).
The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.
Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Qld 100K mapsheets - Mount Lindsay
Derived From Qld 100K mapsheets - Helidon
Derived From Qld 100K mapsheets - Ipswich
Derived From CLM - Woogaroo Subgroup extent
Derived From CLM - Interpolated surfaces of Alluvium depth
Derived From CLM - Extent of Logan and Albert river alluvial systems
Derived From CLM - Bore allocations NSW v02
Derived From CLM - Bore allocations NSW
Derived From CLM - Bore assignments NSW and QLD summary tables
Derived From CLM - Geology NSW & Qld combined v02
Derived From CLM - Orara-Bungawalbin bedrock
Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014
Derived From CLM groundwater model hydraulic property data
Derived From CLM - Koukandowie FM bedrock
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From CLM - Gatton Sandstone extent
Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014
Derived From Bioregional Assessment areas v03
Derived From NSW Geological Survey - geological units DRAFT line work.
Derived From Mean Annual Climate Data of Australia 1981 to 2012
Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)
Derived From Qld 100K mapsheets - Caboolture
Derived From CLM - AWRA Calibration Gauges SubCatchments
Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901
Derived From Qld 100k mapsheets - Murwillumbah
Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013
Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model
Derived From CLM - Bore water level NSW
Derived From Climate model 0.05x0.05 cells and cell centroids
Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers
Derived From CLM - Metgasco 3D geological model formation top grids
Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project
Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Derived From Qld 100K mapsheets - Esk
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From CLM - Qld Surface Geology Mapsheets
Derived From NSW Office of Water Pump Test dataset
Derived From [CLM -
Attribution 2.5 (CC BY 2.5)https://creativecommons.org/licenses/by/2.5/
License information was derived automatically
\r The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.\r \r \r \r Receptor impact models (RIMs) are developed for specific landscape classes. The prediction of Receptor Impact Variables is a multi-stage process. It relies on the runs from surface water and groundwater models at nodes within the analysis extent. These outputs derive directly from the hydrological model. For a given node, there is a value for each combination of hydrological response variable, future, and replicate or run number. Not all variables may be available or appropriate at every node. This differs to the quantile summary information that is otherwise used to summarise the HRV output and is also registered.\r \r
\r There is a key look up table (Excel file) that lists the assessment units (AUIDs) by landscape class (or landscape group if appropriate) and notes that groundwater modelling node and runs, and the surface water modelling node and runs, that should be used for that AUID. In some cases the AUID is only mapped to one set of hydrological modelling output. This look up table represent the AUIDs that require RIV predictions. For NAM and GAL there is a single look up table. For GLO and HUN surface and GW are provided separately. \r \r Receptor impact models (RIMs) are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R.\r \r The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.\r \r
\r Bioregional Assessment Programme (2018) GAL Predictions of receptor impact variables v01. Bioregional Assessment Derived Dataset. Viewed 10 December 2018, http://data.bioregionalassessments.gov.au/dataset/67e0aec1-be25-46f5-badc-b4d895a934aa.\r \r
\r * Derived From Queensland wetland data version 3 - wetland areas.\r \r * Derived From Geofabric Surface Cartography - V2.1\r \r * Derived From Landscape classification of the Galilee preliminary assessment extent\r \r * Derived From Geofabric Surface Cartography - V2.1.1\r \r * Derived From GAL Landscape Class Reclassification for impact and risk analysis 20170601\r \r * Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)\r \r * Derived From Queensland groundwater dependent ecosystems\r \r * Derived From GEODATA TOPO 250K Series 3\r \r * Derived From Multi-resolution Valley Bottom Flatness MrVBF at three second resolution CSIRO 20000211\r \r * Derived From Landscape classification of the Galilee preliminary assessment extent\r \r * Derived From Biodiversity status of pre-clearing and remnant regional ecosystems - South East Qld\r \r
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
Receptor impact models (RIMs) use inputs from surface water and groundwater models. For a given node, there is a value for each combination of hydrological response variable, future, and replicate or run number. RIMs are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R. The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.
Receptor impact models (RIMs) are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R.
The R script IMIA_XXX_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the XXX subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the XXX_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS.
Bioregional Assessment Programme (2018) GAL Ecological expert elicitation and receptor impact models v01. Bioregional Assessment Derived Dataset. Viewed 07 December 2018, http://data.bioregionalassessments.gov.au/dataset/60772948-7354-453c-bffa-37b3f2063083.
Derived From Queensland wetland data version 3 - wetland areas.
Derived From Geofabric Surface Cartography - V2.1
Derived From Landscape classification of the Galilee preliminary assessment extent
Derived From Geofabric Surface Cartography - V2.1.1
Derived From GAL Landscape Class Reclassification for impact and risk analysis 20170601
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From Queensland groundwater dependent ecosystems
Derived From GEODATA TOPO 250K Series 3
Derived From Multi-resolution Valley Bottom Flatness MrVBF at three second resolution CSIRO 20000211
Derived From Landscape classification of the Galilee preliminary assessment extent
Derived From Biodiversity status of pre-clearing and remnant regional ecosystems - South East Qld
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Sloan Digital Sky Survey (SDSS) is a comprehensive survey of the northern sky. This dataset contains a subset of this survey, of 100077 objects classified as galaxies, it includes a CSV file with a collection of information and a set of files for each object, namely JPG image files, FITS and spectra data. This dataset is used to train and explore the astromlp-models collection of deep learning models for galaxies characterisation.
The dataset includes a CSV data file where each row is an object from the SDSS database, and with the following columns (note that some data may not be available for all objects):
objid: unique SDSS object identifier
mjd: MJD of observation
plate: plate identifier
tile: tile identifier
fiberid: fiber identifier
run: run number
rerun: rerun number
camcol: camera column
field: field number
ra: right ascension
dec: declination
class: spectroscopic class (only objetcs with GALAXY are included)
subclass: spectroscopic subclass
modelMag_u: better of DeV/Exp magnitude fit for band u
modelMag_g: better of DeV/Exp magnitude fit for band g
modelMag_r: better of DeV/Exp magnitude fit for band r
modelMag_i: better of DeV/Exp magnitude fit for band i
modelMag_z: better of DeV/Exp magnitude fit for band z
redshift: final redshift from SDSS data z
stellarmass: stellar mass extracted from the eBOSS Firefly catalog
w1mag: WISE W1 "standard" aperture magnitude
w2mag: WISE W2 "standard" aperture magnitude
w3mag: WISE W3 "standard" aperture magnitude
w4mag: WISE W4 "standard" aperture magnitude
gz2c_f: Galaxy Zoo 2 classification from Willett et al 2013
gz2c_s: simplified version of Galaxy Zoo 2 classification (labels set)
Besides the CSV file a set of directories are included in the dataset, in each directory you'll find a list of files named after the objid column from the CSV file, with the corresponding data, the following directories tree is available:
sdss-gs/ ├── data.csv ├── fits ├── img ├── spectra └── ssel
Where, each directory contains:
img: RGB images from the object in JPEG format, 150x150 pixels, generated using the SkyServer DR16 API
fits: FITS data subsets around the object across the u, g, r, i, z bands; cut is done using the ImageCutter library
spectra: full best fit spectra data from SDSS between 4000 and 9000 wavelengths
ssel: best fit spectra data from SDSS for specific selected intervals of wavelengths discussed by Sánchez Almeida 2010
Changelog
v0.0.4 - Increase number of objects to ~100k.
v0.0.3 - Increase number of objects to ~80k.
v0.0.2 - Increase number of objects to ~60k.
v0.0.1 - Initial import.
This dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and location. All location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
This repository contains all data and code required to reproduce the growth curve-fitting analysis from the manuscript: McMillan, E. A., Berrang, M. E., Read, Q. D., Rasamsetti, S., Richards, A. K., Shariat, N. W., & Frye, J. G. (2022). Buffered peptone water formulation does not influence growth of pESI-positive Salmonella enterica serovar Infantis. Journal of Food Protection, 100033. https://doi.org/10.1016/j.jfp.2022.100033 Manuscript abstract Salmonella enterica is a major cause of human foodborne illness and is often attributed to poultry food sources. S. enterica serovar Infantis, specifically those carrying the pESI plasmid, has become a frequently isolated serotype from poultry meat samples at processing and has caused numerous recent human infections. In 2016, the USDA Food Safety and Inspection Service changed the official sampling method for raw poultry products from BPW to using neutralizing BPW (nBPW) as the rinsing agent in order to prevent residual antimicrobial effects from acidifying and oxidizing processing aids. This change was contemporaneous to the emergence of pESI-positive ser. Infantis as a prevalent serovar in poultry, prompting some to question if nBPW could be selecting for this prevalent serovar. We performed two experiments: a comparison of ser. Infantis growth in BPW versus nBPW, and a simulation of regulatory sampling methods. We found that when inoculated into both broths, ser. Infantis initially grows slightly slower in nBPW than in BPW but little difference was seen in abundance after six hours of growth. Additionally, use of nBPW to simulate poultry rinse sample and overnight cold shipping to a regulatory lab did not affect survival or subsequent growth of ser. Infantis in BPW. We concluded that the change in USDA-FSIS methodology to include nBPW in sampling procedures has likely not affected the emergence of S. ser. Infantis as a prevalent serovar in chicken and turkey meat product samples. Contents All necessary data are in a single comma-separated file, Sal_Infantis_growth_curve_data_EAM.csv. All R code is in a single RMarkdown document, salmonella_growth_curve_fitting.Rmd. The RMarkdown contains code to read and process the data, produce exploratory plots, fit the model, do all hoc calculations with the posterior output, and produce figures and tables from the manuscript. Salmonella Infantis growth data: This is a comma-separated file containing data needed to reproduce the growth curve fitting analysis. Columns are: Strain: numerical ID of strain (see Table 1 in manuscript) Colony_Forming_Units_permL(A, B, C): columns 2-4 are three replicate measurements of colony forming units per mL taken from the same sample at the same time. Media: whether nBPW or BPW was used in the growth medium Time_hours: time in hours ranging from 0-6. RMarkdown document with all analysis code: This RMarkdown document contains code to read and process the data, produce exploratory plots, fit the model, do all hoc calculations with the posterior output, and produce figures and tables from the manuscript. Software versions This was run on Windows 10, R version 4.1.2. Models were fit using CmdStan version 2.28.2, with brms version 2.17.0, cmdstanr version 0.4.0, emmeans version 1.7.3, and tidybayes version 3.0.2. Program information National Program: Food Safety (108) Project Plan Number: 6040-32000-085-000-D Resources in this dataset:Resource Title: Salmonella Infantis growth data. File Name: Sal_Infantis_growth_curve_data_EAM.csvResource Title: RMarkdown document with all analysis code. File Name: salmonella_growth_curve_fitting.Rmd
We present a flora and fauna dataset for the Mira-Mataje binational basins. This is an area shared between southwestern Colombia and northwestern Ecuador, where both the Chocó and Tropical Andes biodiversity hotspots converge. Information from 120 sources was systematized in the Darwin Core Archive (DwC-A) standard and geospatial vector data format for geographic information systems (GIS) (shapefiles). Sources included natural history museums, published literature, and citizen science repositories across 18 countries. The resulting database has 33,460 records from 5,281 species, of which 1,083 are endemic and 680 threatened. The diversity represented in the dataset is equivalent to 10\% of the total plant species and 26\% of the total terrestrial vertebrate species in the hotspots and corresponds to 0.07\% of their total area. The dataset can be used to estimate and compare biodiversity patterns with environmental parameters and provide value to ecosystems, ecoregions, and protected areas. The dataset is a baseline for future assessments of biodiversity in the face of environmental degradation, climate change, and accelerated extinction processes. Data format 1: The .rds file extension saves a single object to be read in R and provides better compression, serialization, and integration within the R environment, than simple .csv files. Data format 2: The .csv file has been encoded in UTF-8, and is an ASCII file with text separated by commas. Data format 3: We consolidated a shapefile for the basin that contains layers for vegetation ecosystems and the total number of occurrences, number of species, and number of endemic and threatened species for each ecosystem. A set of 3D shaded-relief map representations of the data in the shapefile can be found at https://doi.org/10.6084/m9.figshare.23499180.v4 We are also including three taxonomic data tables that were used in our technical validation of the presented dataset. These three files are: 1) the_catalog_of_life.tsv (Source: Bánki, O. et al. Catalogue of life checklist (version 2024-03-26). https://doi.org/10.48580/dfz8d (2024)) 2) world_checklist_of_vascular_plants_names.csv (we are also including ancilliary tables "world_checklist_of_vascular_plants_distribution.csv", and "README_world_checklist_of_vascular_plants_.xlsx") (Source: Govaerts, R., Lughadha, E. N., Black, N., Turner, R. & Paton, A. The World Checklist of Vascular Plants, a continuously updated resource for exploring global plant diversity. Sci. Data 8, 215, 10.1038/s41597-021-00997-6 (2021).) 3) world_flora_online.csv (Source: The World Flora Online Consortium et al. World flora online plant list december 2023, 10.5281/zenodo.10425161 (2023).) Source publication and additional context for the dataset: To be announced.
The dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement. Receptor impact models (RIMs) are developed for specific landscape classes. The prediction of Receptor Impact Variables is a multi-stage process. It relies on the runs from surface water and groundwater models at nodes within the analysis extent. These outputs derive directly from the hydrological model. For a given node, there is a value for each combination of hydrological response variable, future, and replicate or run number. Not all variables may be available or appropriate at every node. This differs to the quantile summary information that is otherwise used to summarise the HRV output and is also registered. Dataset History There is a key look up table (Excel file) that lists the assessment units (AUIDs) by landscape class (or landscape group if appropriate) and notes that groundwater modelling node and runs, and the surface water modelling node and runs, that should be used for that AUID. In some cases the AUID is only mapped to one set of hydrological modelling output. This look up table represent the AUIDs that require RIV predictions. For NAM and GAL there is a single look up table. For GLO and HUN surface and GW are provided separately. Receptor impact models (RIMs) are developed for specific landscape classes. The hydrological response variables that a RIM within a landscape class requires are organised by the R script RIM_Prediction_CreateArray.R into an array. The formatted data is available as an R data file format called RDS and can be read directly into R. The R script IMIA_NAM_RIM_predictions.R applies the receptor model functions (RDS object as part of Data set 1: Ecological expert elicitation and receptor impact models for the NAM subregion) to the HRV array for each landscape class (or landscape group) to make predictions of receptor impact varibles (RIVs). Predictions of a receptor impact from a RIM for a landscape class are summarised at relevant AUIDs by the 5th through to the 95th percentiles (in 5% increments) for baseline and CRDP futures. These are available in the NAM_RIV_quantiles_IMIA.csv data set. RIV predictions are further summarised and compared as boxplots (using the R script boxplotsbyfutureperiod.R) and as (aggregated) spatial risk maps using GIS. Dataset Citation Bioregional Assessment Programme (XXXX) NAM_Predictions_of_receptor_impact_variables_v01. Bioregional Assessment Derived Dataset. Viewed 11 December 2018, http://data.bioregionalassessments.gov.au/dataset/b016aaaa-eca8-4d57-b9c2-21809e1e3c60. Dataset Ancestors Derived From Landscape classification of the Namoi preliminary assessment extent Derived From Namoi CMA Groundwater Dependent Ecosystems
This data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `