Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R Scripts contain statistical data analisys for streamflow and sediment data, including Flow Duration Curves, Double Mass Analysis, Nonlinear Regression Analysis for Suspended Sediment Rating Curves, Stationarity Tests and include several plots.
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the upcoming some 40 summary papers I will demonstrate a comprehensive view of Applied Multivariate Statistical Modeling. First, I will start with a thorough introduction of AMSM. Then, I will explain the univariate descriptive statistics, sampling distribution, estimation, in addition to hypothesis testing. After that, I will do a comprehensive review of multivariate descriptive statistics, the normal distribution of it, and the inferential statistics. Having we accomplished that, it will be the time to discuss some various models: ANOVA, MANOVA, Multiple Linear Regression, and Multivariate Linear Regression. Furthermore, we will discuss, Principal Component analysis, Factor Analysis, and Cluster Analysis. At the end of this series of summaries, some intro to structural equation modeling (SEM), and correspondence analysis will be discussed. Prerequisite skills are, of which readers must have, basic knowledge of statistics and probability, in addition to some advanced knowledge of linear algebra. I have published summary papers in both disciplines, see the reference page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We conducted a benchmarking analysis of 16 summary-level data-based MR methods for causal inference with five real-world genetic datasets, focusing on three key aspects: type I error control, the accuracy of causal effect estimates, replicability, and power.
The datasets used in the MR benchmarking study can be downloaded here:
"dataset-GWASATLAS-negativecontrol.zip": the GWASATLAS dataset for evaluation of type I error control in confounding scenario (a): Population stratification
"dataset-NealeLab-negativecontrol.zip": the Neale Lab dataset for evaluation of type I error control in confounding scenario (a): Population stratification;
"dataset-PanUKBB-negativecontrol.zip": the Pan UKBB dataset for evaluation of type I error control in confounding scenario (a): Population stratification;
"dataset-Pleiotropy-negativecontrol": the dataset used for evaluation of type I error control in confounding scenario (b): Pleiotropy;
"dataset-familylevelconf-negativecontrol.zip": the dataset used for evaluation of type I error control in confounding scenario (c): Family-level confounders;
"dataset_ukb-ukb.zip": the dataset used for evaluation of the accuracy of causal effect estimates;
"dataset-LDL-CAD_clumped.zip": the dataset used for evaluation of replicability and power;
Each of the datasets contains the following files:
"Tested Trait pairs": the exposure-outcome trait pairs to be analyzed;
"MRdat" refers to the summary statistics after performing IV selection (p-value < 5e-05) and PLINK LD clumping with a clumping window size of 1000kb and an r^2 threshold of 0.001.
"bg_paras" are the estimated background parameters "Omega" and "C" which will be used for MR estimation in MR-APSS.
Note:
Supplemental Tables S1-S7.xlxs provide the download link for the original GWAS summary-level data for the traits used as exposures or outcomes.
The formatted dataset after quality control can be accessible at our GitHub website (https://github.com/YangLabHKUST/MRbenchmarking).
The details on quality control of GWAS summary statistics, formatting GWASs, and LD clumping for IV selection can be found on the MR-APSS software tutorial on the MR-APSS website (https://github.com/YangLabHKUST/MR-APSS).
R code for running MR methods is also available at https://github.com/YangLabHKUST/MRbenchmarking.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A visual summary of the contents of the 4TU Research Data Repository, in 4 plots.
analyze the current population survey (cps) annual social and economic supplement (asec) with r the annual march cps-asec has been supplying the statistics for the census bureau's report on income, poverty, and health insurance coverage since 1948. wow. the us census bureau and the bureau of labor statistics ( bls) tag-team on this one. until the american community survey (acs) hit the scene in the early aughts (2000s), the current population survey had the largest sample size of all the annual general demographic data sets outside of the decennial census - about two hundred thousand respondents. this provides enough sample to conduct state- and a few large metro area-level analyses. your sample size will vanish if you start investigating subgroups b y state - consider pooling multiple years. county-level is a no-no. despite the american community survey's larger size, the cps-asec contains many more variables related to employment, sources of income, and insurance - and can be trended back to harry truman's presidency. aside from questions specifically asked about an annual experience (like income), many of the questions in this march data set should be t reated as point-in-time statistics. cps-asec generalizes to the united states non-institutional, non-active duty military population. the national bureau of economic research (nber) provides sas, spss, and stata importation scripts to create a rectangular file (rectangular data means only person-level records; household- and family-level information gets attached to each person). to import these files into r, the parse.SAScii function uses nber's sas code to determine how to import the fixed-width file, then RSQLite to put everything into a schnazzy database. you can try reading through the nber march 2012 sas importation code yourself, but it's a bit of a proc freak show. this new github repository contains three scripts: 2005-2012 asec - download all microdata.R down load the fixed-width file containing household, family, and person records import by separating this file into three tables, then merge 'em together at the person-level download the fixed-width file containing the person-level replicate weights merge the rectangular person-level file with the replicate weights, then store it in a sql database create a new variable - one - in the data table 2012 asec - analysis examples.R connect to the sql database created by the 'download all microdata' progr am create the complex sample survey object, using the replicate weights perform a boatload of analysis examples replicate census estimates - 2011.R connect to the sql database created by the 'download all microdata' program create the complex sample survey object, using the replicate weights match the sas output shown in the png file below 2011 asec replicate weight sas output.png statistic and standard error generated from the replicate-weighted example sas script contained in this census-provided person replicate weights usage instructions document. click here to view these three scripts for more detail about the current population survey - annual social and economic supplement (cps-asec), visit: the census bureau's current population survey page the bureau of labor statistics' current population survey page the current population survey's wikipedia article notes: interviews are conducted in march about experiences during the previous year. the file labeled 2012 includes information (income, work experience, health insurance) pertaining to 2011. when you use the current populat ion survey to talk about america, subract a year from the data file name. as of the 2010 file (the interview focusing on america during 2009), the cps-asec contains exciting new medical out-of-pocket spending variables most useful for supplemental (medical spending-adjusted) poverty research. confidential to sas, spss, stata, sudaan users: why are you still rubbing two sticks together after we've invented the butane lighter? time to transition to r. :D
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset presents the code written for the analysis and modelling for the Jellyfish Forecasting System for NESP TWQ Project 2.2.3. The Jellyfish Forecasting System (JFS) searches for robust statistical relationships between historical sting events (and observations) and local environmental conditions. These relationships are tested using data to quantify the underlying uncertainties. They then form the basis for forecasting risk levels associated with current environmental conditions.
The development of the JFS modelling and analysis is supported by the Venomous Jellyfish Database (sting events and specimen samples – November 2018) (NESP 2.2.3, CSIRO) with corresponding analysis of wind fields and tidal heights along the Queensland coastline. The code has been calibrated and tested for the study focus regions including Cairns (Beach, Island, Reef), Townsville (Beach, Island+Reef) and Whitsundays (Beach, Island+Reef).
The JFS uses the European Centre for Medium-Range Weather forecasting (ECMWF) wind fields from the ERA Interim, Daily product (https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era-interim). This daily product has global coverage at a spatial resolution of approximately 80km. However, only 11 locations off the Queensland coast were extracted covering the period 1-Jan-1985 to 31-Dec-2016. For the modelling, the data has been transformed into CSV files containing date, eastward wind (m/s) and northward wind (m/s), for each of the 11 geographical locations.
Hourly tidal height was calculated from tidal harmonics supplied by the Bureau of Meteorology (http://www.bom.gov.au/oceanography/projects/ntc/ntc.shtml) using the XTide software (http://www.flaterco.com/xtide/). Hourly tidal heights have been calculated for 7 sites along the Queensland coast (Albany Island, Cairns, Cardwell, Cooktown, Fife, Grenville, Townsville) for the period 1-Jan-1985 to 31-Dec-2017. Data has been transformed into CSV files, one for each of the 7 sites. Columns correspond to number of days since 1-Jan 1990 and tidal height (m).
Irukandji stings were then modelled using a generalised linear model (GLM). A GLM generalises ordinary linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value (McCullagh & Nelder 1989). For each region, we used a GLM with the number of Irukandji stings per day as the response variable. The GLM had a Poisson error structure and a log link function (Crawley 2005). For the Poisson GLMs, we inferred absences when stings were not recorded in the data for a day. We consider that there was reasonably consistent sampling effort in the database since 1985, but very patchy prior to this date. It should be noted that Irukandji are very patchy in time; for example, there was a single sting record in 2017 despite considerable effort trying to find stings in that year. Although the database might miss small and localised Irukandji sting events, we believe it captures larger infestation events.
We included six predictors in the models: Month, two wind variables, and three tidal variables. Month was a factor and arranged so that the summer was in the middle of the year (i.e., from June to May). The two wind variables were Speed and Direction. For each day within each region (Cairns, Townsville or Whitsundays), hourly wind-speed and direction was used. We derived cumulative wind Speed and Direction, working backwards from each day, with the current day being Day 1. We calculated cumulative winds from the current day (Day 1) to 14 days previously for every day in every Region and Area. To provide greater weighting for winds on more recent days, we used an inverse weighting for each day, where the weighting was given by 1/i for each day i. Thus, the Cumulative Speed for n days is given by:
Cumulative Speed_n=(\sum_(i=1)^n Speed_i/i) / (\sum_(i=1)^n 1/i)
For example, calculations for the 3-day cumulative wind speed are:
(1/1×Wind Day 1 + 1/2 × Wind Day 2 + 1/3 × Wind Day 3) / (1/1+1/2+1/3)
Similarly, we calculated the cumulative weighted wind Direction using the formula:
Cumulative Direction_n=(\sum_(i=1)^n Direction_i/i) / (\sum_(i=1)^n 1/i)
We used circular statistics in the R Package Circular to calculate the weighted cumulative mean, because direction 0º is the same as 360º. We initially used a smoother for this term in the model, but because of its non-linearity and the lack of winds of all directions, we found that it was better to use wind Direction as a factor with four levels (NW, NE, SE and SW). In some Regions and Areas, not all wind Directions were present.
To assign each event to the tidal cycle, we used tidal data from the closest of the seven stations to calculate three tidal variables: (i) the tidal range each day (m); (ii) the tidal height (m); and (iii) whether the tide was incoming or outgoing. To estimate the three tidal variables, the time of day of the event was required. However, the Time of Day was only available for 780 observations, and the 291 missing observations were estimated assuming a random Time of Day, which will not influence the relationship but will keep these rows in the analysis. Tidal range was not significant in any models and will not be considered further.
To focus on times when Irukandji were present, months when stings never occurred in an area/region were excluded from the analysis – this is generally the winter months. For model selection, we used Akaike Information Criterion (AIC), which is an estimate of the relative quality of models given the data, to choose the most parsimonious model. We thus do not talk about significant predictors, but important ones, consistent with information theoretic approaches.
Limitations: It is important to note that while the presence of Irukandji is more likely on high risk days, the forecasting system should not be interpreted as predicting the presence of Irukandji or that stings will occur.
Format:
It is a text file with a .r extension, the default code format in R. This code runs on the csv datafile “VJD_records_EXTRACT_20180802_QLD.csv” that has latitude, longitude, date, and time of day for each Irukandji sting on the GBR. A subset of these data have been made publicly available through eAtlas, but not all data could be made publicly available because of permission issues. For more information about data permissions, please contact Dr Lisa Gershwin (lisa.gershwin@stingeradvisor.com).
Data Location:
This dataset is filed in the eAtlas enduring data repository at: data\custodian\2016-18-NESP-TWQ-2\2.2.3_Jellyfish-early-warning\data\ and https://github.com/eatlas/NESP_2.2.3_Jellyfish-early-warning
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data sets and computer code for the book chapter titled "Missing Data in the Analysis of Multilevel and Dependent Data" submitted for publication in the second edition of "Dependent Data in Social Science Research" (Stemmler et al., 2015). This repository includes the computer code (".R") and the data sets from both example analyses (Examples 1 and 2). The data sets are available in two file formats (binary ".rda" for use in R; plain-text ".dat").
The data sets contain simulated data from 23,376 (Example 1) and 23,072 (Example 2) individuals from 2,000 groups on four variables:
ID = group identifier (1-2000) x = numeric (Level 1) y = numeric (Level 1) w = binary (Level 2)
In all data sets, missing values are coded as "NA".
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset was created within the Bioregional Assessment Programme. Data has not been derived from any source datasets. Metadata has been compiled by the Bioregional Assessment Programme.
This dataset contains a set of generic R scripts that are used in the propagation of uncertainty through numerical models.
The dataset contains a set of R scripts that are loaded as a library. The R scripts are used to carry out the propagation of uncertainty through numerical models. The scripts contain the functions to create the statistical emulators and do the necessary data transformations and backtransformations. The scripts are self-documenting and created by Dan Pagendam (CSIRO) and Warren Jin (CSIRO).
Bioregional Assessment Programme (2016) R-scripts for uncertainty analysis v01. Bioregional Assessment Source Dataset. Viewed 13 March 2019, http://data.bioregionalassessments.gov.au/dataset/322c38ef-272f-4e77-964c-a14259abe9cf.
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd
file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd
has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd
and Figures.Rmd
. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd
, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd
used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Naturally occurring radium isotopes (224Ra, 226Ra, 228Ra) were used in determining
lateral mixing processes which are reported in dpm/m3.
AMLR (Antarctic Marine Living Resources) R/V Yuzhmorgeologiya Jan2006:
The research program was focused in the southern Drake Passage
along the Shackelton Shelf located near the Bransfield Strait.
Samples were obtained from the R/V Yuzhmorgeologiya and inflatables
that were taken to island locations.
Lat/Lon Bounding Box
-62.2538Lat, -62.9966Lon
-63.2335Lat, -59.0332Lon
-59.9964Lat, -55.7612Lon
-61.4995Lat, -53.9996Lon
NBP (Nathaniel B. Palmer) R/V Nathaniel B. Palmer July2006:
The research was conducted in the same region of the Drake Passage as the AMLR cruise.
Samples were obtained aboard the R/V Nathaniel B. Palmer
Lat/Lon bounding box
-60.4991Lat, -58.5613Lon
-62.3599Lat, -58.0392Lon
-60.2783Lat, -57.4509Lon
-61.2683Lat, -54.2852Lon
Brzezinski, M.A., Nelson, D.M., Franck, V.M. and Sigmon, D.E., 2001. "Silicon dynamics within an intense open-ocean diatom bloom in the pacific sector of the southern ocean." Deep-Sea Research Part II 48, pp. 3997-4018
Michiel Rutgers van der Loeff, Manmohan M. Sarin, Mark Baskaran, Claudia Benitez-Nelson, Ken O. Buesseler, Matt Charette, Minhan Dai, Örjan Gustafsson, Pere Masque, Paul J. Morris, Kent Orlandini, Alessia Rodriguez y Baena, Nicolas Savoye, Sabine Schmidt, Robert Turnewitsch, Ingrid Vöge, James T. Waples. "A review of present techniques and methodological advances in analyzing 234Th in aquatic systems" Marine Chemistry, Volume 100, Issues 3-4, 1 August 2006, Pages 190-212
Pike, S.M., K.O. Buesseler, J. Andrews and N. Savoye, 2005. "Quantification of 234Th recovery in small volume sea water samples by inductively coupled plasma mass spectrometry." (PDF) Journal of Radioanalytical and Nuclear Chemistry, 263(2): 355-360.
Willard S. Moore and Ralph Arnold (1996). "Measurement of 223Ra and 224Ra in coastal waters using a delayed coincidence counter." Journal of Geophysical Research, vol. 101, no. c1, pages 1321-1329, January 15, 1996.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is about book subjects and is filtered where the books is An introduction to analysis of financial data with R, featuring 10 columns including authors, average publication date, book publishers, book subject, and books. The preview is ordered by number of books (descending).
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Script and necessary data (subtransects size , organisms counts, size and mass) to compute the abundance of sponges and asteroids and the biomass of sponges on the 2007 (PS69/724-1) and 2011 (PS77/253-1) ROV transects. The Zip-archive contains a more detailled description of the script content and the various steps to obtain the data.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset was used to collect and analyze data for the MPhil Thesis, "Lithic Technological Change and Behavioral Responses to the Last Glacial Maximum Across Southwestern Europe." This dataset contains the raw data collected from published literature, and the R code used to run correspondence analysis on the data and create graphical representations of the results. It also contains notes to aid in interpreting the dataset, and a list detailing how variables in the dataset were grouped for use in analysis. The file "Diss Data.xlsx" contains the raw data collected from publications on Upper Paleolithic archaeological sites in France, Spain, and Italy. This data is the basis for all other files included in the repository. The document "Diss Data Notes.docx" contains detailed information about the raw data, and is useful for understanding its context. "Revised Variable Groups.docx" lists all of the variables from the raw data considered "tool types" and the major categories into which they were sorted for analysis. "Group Definitions.docx" provides the criteria considered to make the groups listed in the "Revised Variable Groups" document. "r_diss_data.xlsx" contains only the variables from the raw data that were considered for correspondence analysis carried-out in RStudio. The document "ca_barplot.R" contains the RStudio code written to perform correspondence analysis and percent composition analysis on the data from "R_Diss_Data.xlsx". This file also contains code for creating scatter plots and bar graphs displaying the results from the CA and Percent Comp tests. The RStudio packages used to carry out the analysis and to create graphical representations of the analysis results are listed under "Software/Usage Instructions." "climate_curve.R" contains the RStudio code used to create climate curves from NGRIP and GRIP data available open-access from the Neils Bohr Institute Center of Ice and Climate. The link to access this data is provided in "Related Resources" below.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
These files are the dataset for the following paper "Classifying hot water chemistry: Application of MULTIVARIATE STATISTICS". Authors: Prihadi Sumintadireja1, Dasapta Erwin Irawan1, Yuano Rezky2, Prana Ugiana Gio3, Anggita Agustin1
The linked bookdown contains the notes and most exercises for a course on data analysis techniques in hydrology using the programming language R. The material will be updated each time the course is taught. If new topics are added, the topics they replace will remain, in case they are useful to others.
I hope these materials can be a resource to those teaching themselves R for hydrologic analysis and/or for instructors who may want to use a lesson or two or the entire course. At the top of each chapter there is a link to a github repository. In each repository is the code that produces each chapter and a version where the code chunks within it are blank. These repositories are all template repositories, so you can easily copy them to your own github space by clicking Use This Template on the repo page.
In my class, I work through the each document, live coding with students following along.Typically I ask students to watch as I code and explain the chunk and then replicate it on their computer. Depending on the lesson, I will ask students to try some of the chunks before I show them the code as an in-class activity. Some chunks are explicitly designed for this purpose and are typically labeled a “challenge.”
Chapters called ACTIVITY are either homework or class-period-long in-class activities. The code chunks in these are therefore blank. If you would like a key for any of these, please just send me an email.
If you have questions, suggestions, or would like activity answer keys, etc. please email me at jpgannon at vt.edu
Finally, if you use this resource, please fill out the survey on the first page of the bookdown (https://forms.gle/6Zcntzvr1wZZUh6S7). This will help me get an idea of how people are using this resource, how I might improve it, and whether or not I should continue to update it.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sensory data analysis by example with R is a book. It was written by Sébastien Lê and published by Chapman&Hall/CRC in 2014.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.