Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
R Scripts contain statistical data analisys for streamflow and sediment data, including Flow Duration Curves, Double Mass Analysis, Nonlinear Regression Analysis for Suspended Sediment Rating Curves, Stationarity Tests and include several plots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We conducted a benchmarking analysis of 16 summary-level data-based MR methods for causal inference with five real-world genetic datasets, focusing on three key aspects: type I error control, the accuracy of causal effect estimates, replicability, and power.
The datasets used in the MR benchmarking study can be downloaded here:
Each of the datasets contains the following files:
Note:
Each R script replicates all of the example code from one chapter from the book. All required data for each script are also uploaded, as are all data used in the practice problems at the end of each chapter. The data are drawn from a wide array of sources, so please cite the original work if you ever use any of these data sets for research purposes.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the upcoming some 40 summary papers I will demonstrate a comprehensive view of Applied Multivariate Statistical Modeling. First, I will start with a thorough introduction of AMSM. Then, I will explain the univariate descriptive statistics, sampling distribution, estimation, in addition to hypothesis testing. After that, I will do a comprehensive review of multivariate descriptive statistics, the normal distribution of it, and the inferential statistics. Having we accomplished that, it will be the time to discuss some various models: ANOVA, MANOVA, Multiple Linear Regression, and Multivariate Linear Regression. Furthermore, we will discuss, Principal Component analysis, Factor Analysis, and Cluster Analysis. At the end of this series of summaries, some intro to structural equation modeling (SEM), and correspondence analysis will be discussed. Prerequisite skills are, of which readers must have, basic knowledge of statistics and probability, in addition to some advanced knowledge of linear algebra. I have published summary papers in both disciplines, see the reference page.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GWAMA summary statistics of four steroid hormone levels and one steroid hormone ratio using fixed-effect model.
When using this data, please cite: Pott J, Horn K, Zeidler R, et al.. Sex-Specific Causal Relations between Steroid Hormones and Obesity - A Mendelian Randomization Study. Metabolites 2021, 11, 738. https://doi.org/10.3390/metabo11110738
All txt files contain the following columns:
markername
chr
bp_hg19 (base position according to hg19)
ea (effect allele)
oa (other allele)
eaf (effect allele frequency)
info (minimal info score across all used studies)
nSamples (sample size per SNP)
nStudies (number of studies)
beta (effect estimate)
se (standard error)
p (p-value)
I2 (SNP heterogeneity across studies)
phenotype (phenotyp setting)
https://www.gnu.org/licenses/gpl-3.0.htmlhttps://www.gnu.org/licenses/gpl-3.0.html
This dataset is to visualize the 4TU.ResearchData resources for the plot-a-thon.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Full summary statistics from 41 epigenome-wide association studies (EWAS) conducted by The EWAS Catalog team (www.ewascatalog.org). Meta-data is found in the "studies-full.csv" file and the results are in "full_stats.tar.gz". Unzipping the "full_stats.tar.gz" file will reveal a folder containing 41 csv files, each with the full summary statistics from one EWAS. The results can be linked to the meta-data using the "Results_file" column in "studies-full.csv". These analyses were conducted using data extracted from the Gene Expression Omnibus (GEO). These data were extracted using the geograbi R package. For more information on the EWAS, please consult our paper: Battram, Thomas, et al. "The EWAS Catalog: A Database of Epigenome-wide Association Studies." OSF Preprints, 4 Feb. 2021. https://doi.org/10.31219/osf.io/837wn. Please cite the paper if you use this dataset.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
A visual summary of the contents of the 4TU Research Data Repository, in 4 plots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We provide the summary statistics of running CONTENT, the context-by-context approach, and UTMOST on over 22 phenotypes. The phenotypes are listed in the manuscript, and their respective studies and sample size can be found in a table under the supplementary section of the manuscript. All 3 methods were trained on GTEx v7 as well as CLUES, a single-cell RNA sequencing dataset of PBMCs. The data include the gene name, model, cross-validated R^2, prediction pvalue, TWAS p value, TWAS Z score, and a column titled "hFDR" indicating whether the association was statistically significant while employing hierarchical FDR. The benefits of employing such an approach for all methods can be found in the manuscript.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data and code archive provides all the data and code for replicating the empirical analysis that is presented in the journal article "A Ray-Based Input Distance Function to Model Zero-Valued Output Quantities: Derivation and an Empirical Application" authored by Juan José Price and Arne Henningsen and published in the Journal of Productivity Analysis (DOI: 10.1007/s11123-023-00684-1).
We conducted the empirical analysis with the "R" statistical software (version 4.3.0) using the add-on packages "combinat" (version 0.0.8), "miscTools" (version 0.6.28), "quadprog" (version 1.5.8), sfaR (version 1.0.0), stargazer (version 5.2.3), and "xtable" (version 1.8.4) that are available at CRAN. We created the R package "micEconDistRay" that provides the functions for empirical analyses with ray-based input distance functions that we developed for the above-mentioned paper. Also this R package is available at CRAN (https://cran.r-project.org/package=micEconDistRay).
This replication package contains the following files and folders:
This dataset was derived by the Bioregional Assessment Programme from multiple source datasets. The source datasets are identified in the Lineage field in this metadata statement. The processes undertaken to produce this derived dataset are described in the History field in this metadata statement.
This dataset contains the data and scripts to generate the hydrological response variables for surface water in the Clarence Moreton subregion as reported in CLM261 (Gilfedder et al. 2016).
File CLM_AWRA_HRVs_flowchart.png shows the different files in this dataset and how they interact. The python and R-scripts are written by the BA modelling team to, as detailed below, read, combine and analyse the source datasets CLM AWRA model, CLM groundwater model V1 and CLM16swg Surface water gauging station data within the Clarence Moreton Basin to create the hydrological response variables for surface water as reported in CLM2.6.1 (Gilfedder et al. 2016).
R-script HRV_SWGW_CLM.R reads, for each model simulation, the outputs from the surface water model in netcdf format from file Qtot.nc (dataset CLM AWRA model) and the outputs from the groundwater model, flux_change.csv (dataset CLM groundwater model V1) and creates a set of files in subfolder /Output for each GaugeNr and simulation Year:
CLM_GaugeNr_Year_all.csv and CLM_GaugeNR_Year_baseline.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for baseline conditions
CLM_GaugeNr_Year_CRDP.csv: the set of 9 HRVs for GaugeNr and Year for all 5000 simulations for CRDP conditions (=AWRA streamflow - MODFLOW change in SW-GW flux)
CLM_GaugeNr_Year_minMax.csv: minimum and maximum of HRVs over all 5000 simulations
Python script CLM_collate_DoE_Predictions.py collates that information into following files, for each HRV and each maxtype (absolute maximum (amax), relative maximum (pmax) and time of absolute maximum change (tmax)):
CLM_AWRA_HRV_maxtyp_DoE_Predictions: for each simulation and each gauge_nr, the maxtyp of the HRV over the prediction period (2012 to 2102)
CLM_AWRA_HRV_DoE_Observations: for each simulation and each gauge_nr, the HRV for the years that observations are available
CLM_AWRA_HRV_Observations: summary statistics of each HRV and the observed value (based on data set CLM16swg Surface water gauging station data within the Clarence Moreton Basin)
CLM_AWRA_HRV_maxtyp_Predictions: summary statistics of each HRV
R-script CLM_CreateObjectiveFunction.R calculates for each HRV the objective function value for all simulations and stores it in CLM_AWRA_HRV_ss.csv. This file is used by python script CLM_AWRA_SI.py to generate figure CLM-2615-002-SI.png (sensitivity indices).
The AWRA objective function is combined with the overall objective function from the groundwater model in dataset CLM Modflow Uncertainty Analysis (CLM_MF_DoE_ObjFun.csv) into csv file CLM_AWRA_HRV_oo.csv. This file is used to select behavioural simulations in python script CLM-2615-001-top10.py. This script uses files CLM_NodeOrder.csv and BA_Visualisation.py to create the figures CLM-2616-001-HRV_10pct.png.
Bioregional Assessment Programme (2016) CLM AWRA HRVs Uncertainty Analysis. Bioregional Assessment Derived Dataset. Viewed 28 September 2017, http://data.bioregionalassessments.gov.au/dataset/e51a513d-fde7-44ba-830c-07563a7b2402.
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements 20131204
Derived From Qld 100K mapsheets - Mount Lindsay
Derived From Qld 100K mapsheets - Helidon
Derived From Qld 100K mapsheets - Ipswich
Derived From CLM - Woogaroo Subgroup extent
Derived From CLM - Interpolated surfaces of Alluvium depth
Derived From CLM - Extent of Logan and Albert river alluvial systems
Derived From CLM - Bore allocations NSW v02
Derived From CLM - Bore allocations NSW
Derived From CLM - Bore assignments NSW and QLD summary tables
Derived From CLM - Geology NSW & Qld combined v02
Derived From CLM - Orara-Bungawalbin bedrock
Derived From CLM16gwl NSW Office of Water_GW licence extract linked to spatial locations_CLM_v3_13032014
Derived From CLM groundwater model hydraulic property data
Derived From CLM - Koukandowie FM bedrock
Derived From GEODATA TOPO 250K Series 3, File Geodatabase format (.gdb)
Derived From NSW Office of Water - National Groundwater Information System 20140701
Derived From CLM - Gatton Sandstone extent
Derived From CLM16gwl NSW Office of Water, GW licence extract linked to spatial locations in CLM v2 28022014
Derived From Bioregional Assessment areas v03
Derived From NSW Geological Survey - geological units DRAFT line work.
Derived From Mean Annual Climate Data of Australia 1981 to 2012
Derived From CLM Preliminary Assessment Extent Definition & Report( CLM PAE)
Derived From Qld 100K mapsheets - Caboolture
Derived From CLM - AWRA Calibration Gauges SubCatchments
Derived From CLM - NSW Office of Water Gauge Data for Tweed, Richmond & Clarence rivers. Extract 20140901
Derived From Qld 100k mapsheets - Murwillumbah
Derived From AHGFContractedCatchment - V2.1 - Bremer-Warrill
Derived From Bioregional Assessment areas v01
Derived From Bioregional Assessment areas v02
Derived From QLD Current Exploration Permits for Minerals (EPM) in Queensland 6/3/2013
Derived From Pilot points for prediction interpolation of layer 1 in CLM groundwater model
Derived From CLM - Bore water level NSW
Derived From Climate model 0.05x0.05 cells and cell centroids
Derived From CLM - New South Wales Department of Trade and Investment 3D geological model layers
Derived From CLM - Metgasco 3D geological model formation top grids
Derived From State Transmissivity Estimates for Hydrogeology Cross-Cutting Project
Derived From CLM - Extent of Bremer river and Warrill creek alluvial systems
Derived From NSW Catchment Management Authority Boundaries 20130917
Derived From QLD Department of Natural Resources and Mining Groundwater Database Extract 20131111
Derived From Qld 100K mapsheets - Esk
Derived From QLD Dept of Natural Resources and Mines, Groundwater Entitlements linked to bores and NGIS v4 28072014
Derived From BILO Gridded Climate Data: Daily Climate Data for each year from 1900 to 2012
Derived From CLM - Qld Surface Geology Mapsheets
Derived From NSW Office of Water Pump Test dataset
Derived From [CLM -
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
\r \r The Australian Public Service Statistical Bulletin 2010-11 presents a\r summary of employment under the Public Service Act 1999 at 30 June 2011 and\r during the 2010-11 financial year, as well as summary data for the past\r 15 years. This Excel dataset consists of tables used to create the Statistical\r Bulletin. You can view the Bulletin online atA "http://www.apsc.gov.au/about-the-apsc/parliamentary/aps-r%0Astatistical-bulletin/aps-statistical-bulletin-2010-11">http://www.apsc.gov.au/about-\r the-apsc/parliamentary/aps-statistical-bulletin/aps-statistical-\r bulletin-2010-11\r \r
This data package consists of 26 years (1998-2023) of environmental data and 22 years (2000-2022) years of bioclimatic data associated with CAP-LTER long-term point-count bird censusing sites (https://doi.org/10.6073/pasta/4777d7f0a899f506d6d4f9b5d535ba09), temporally aggregated by year and by four meteorological seasons (Winter, Spring, Summer, Fall). The environmental variables include land surface temperature (LST), three spectral indices of vegetation and water – the normalized difference vegetation index (NDVI), the soil adjusted vegetation index (SAVI), and modified normalized difference water index (MNDWI) – and four spectral indices of impervious surface/urbanization. Impervious surface indices include the normalized difference built-up index (NDBI), the normalized difference impervious surface index (NDISI), the enhanced normalized differences impervious surface index (ENDISI), and the normalized impervious surface index (NISI). LST and all spectral indices were derived from annual and seasonal composites of 30-m resolution Landsat 5-9 Level-2 Surface Reflectance imagery. The seven bioclimatic variables (e.g., air temperature, precipitation) were sourced from 1-km resolution gridded estimates of daily climatic data from NASA Daymet V4. We created temporally-aggregated Daymet raster images by calculating mean pixel-values for each season and year, as well as seasonally and annually summed precipitation. We summarized the values of each environmental variable by generating variously-sized (100-m, 500-m, 1000-m) buffers around each bird point count location and extracting weighted mean values of each environmental variable, with each pixel's values weighted by the proportion of its area falling within the buffer. All imagery retrieval and data processing were completed with Google Earth Engine (Gorelick et al. 2017) and program R. A complete description of data processing methods, including the aggregation of imagery by year and season and the calculation of spectral indices, can be found in the data package metadata (see 'Methods and Protocols') and accompanying Javascript and R code.
Argumentative skills are crucial for any individual at the personal and professional levels. In recent decades, there has been an increasing concern about the weak undergraduates' skills and considerable difficulty in reworking and expressing one's own reflection on a topic. In turn, this has implications for being a critical thinker, able to express an original point of view. Tailored interventions in Higher Education could constitute a powerful approach to promote argumentative skills and extend these skills to professional and personal life. In this regard, argument maps (AM) could prove to be a valuable support to the visualization process of arguments. They don’t just create associations between concepts, but trace the logical relationships between different statements, allowing you to track the reasoning chain and understand it better. We conducted an experimental study to investigate how a path with AM could support students in increasing the level of text comprehension (CoT) competence, in terms of identifying the elements of an argumentative text, and critical thinking (CT), in terms of reconstructing meaning and building their own reflection.
Our preliminary descriptive analysis suggested the positivity of the role of AM in increasing students’ CoT and CT proficiency levels
This Zenodo record follows the full analysis process with R (https://cran.r-project.org/bin/windows/base/ ) composed of the following datasets and script:
Comprehension of Text and AMs Results - ExpAM.xlsx
Critical Thinking Results - CriThink.xlsx
Argumentative skills in Forum - ExpForum.xlsx
Selfassessment Results - Dataset_Quest.xlsx
Data for Correlation and Regression - Dataset_CorRegr.xlsx
Descriptive Statistics - Preliminary Analysis.R
Inferential Statistics - Correlation and Regression.R
Any comments or improvements are welcome!
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
GWAMA summary statistics of PCSK9 levels using fixed-effect model. Genome-wide data is given for Europeans with statin adjustment and Europeans without statin treatment only (subset of the population). In addition, locus-wide data of the PCSK9 gene locus for African-Americans without statin treatment is listed.
When using this data, please cite: Pott J, Gadin J, Theusch E, et al.. Meta-GWAS of PCSK9 levels detects two novel loci at APOB and TM6SF2. Hum Mol Genet. 2021 Sep 30:ddab279. doi: 10.1093/hmg/ddab279. PMID: 34590679
All txt files contain the following columns:
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Bayesian inference operates under the assumption that the empirical data are a good statistical fit to the analytical model, but this assumption can be challenging to evaluate. Here, we introduce a novel r package that utilizes posterior predictive simulation to evaluate the fit of the multispecies coalescent model used to estimate species trees. We conduct a simulation study to evaluate the consistency of different summary statistics in comparing posterior and posterior predictive distributions, the use of simulation replication in reducing error rates and the utility of parallel process invocation towards improving computation times. We also test P2C2M on two empirical data sets in which hybridization and gene flow are suspected of contributing to shared polymorphism, which is in violation with the coalescent model: Tamias chipmunks and Myotis bats. Our results indicate that (i) probability-based summary statistics display the lowest error rates, (ii) the implementation of simulation replication decreases the rate of type II errors, and (iii) our r package displays improved statistical power compared to previous implementations of this approach. When probabilistic summary statistics are used, P2C2M corroborates the assumption that genealogies collected from Tamias and Myotis are not a good fit to the multispecies coalescent model. Taken as a whole, our findings argue that an assessment of the fit of the multispecies coalescent model should accompany any phylogenetic analysis that estimates a species tree.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
As high-throughput methods become more common, training undergraduates to analyze data must include having them generate informative summaries of large datasets. This flexible case study provides an opportunity for undergraduate students to become familiar with the capabilities of R programming in the context of high-throughput evolutionary data collected using macroarrays. The story line introduces a recent graduate hired at a biotech firm and tasked with analysis and visualization of changes in gene expression from 20,000 generations of the Lenski Lab’s Long-Term Evolution Experiment (LTEE). Our main character is not familiar with R and is guided by a coworker to learn about this platform. Initially this involves a step-by-step analysis of the small Iris dataset built into R which includes sepal and petal length of three species of irises. Practice calculating summary statistics and correlations, and making histograms and scatter plots, prepares the protagonist to perform similar analyses with the LTEE dataset. In the LTEE module, students analyze gene expression data from the long-term evolutionary experiments, developing their skills in manipulating and interpreting large scientific datasets through visualizations and statistical analysis. Prerequisite knowledge is basic statistics, the Central Dogma, and basic evolutionary principles. The Iris module provides hands-on experience using R programming to explore and visualize a simple dataset; it can be used independently as an introduction to R for biological data or skipped if students already have some experience with R. Both modules emphasize understanding the utility of R, rather than creation of original code. Pilot testing showed the case study was well-received by students and faculty, who described it as a clear introduction to R and appreciated the value of R for visualizing and analyzing large datasets.