Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.
Facebook
TwitterThis archive contains code and data for reproducing the analysis for “Replication Data for Revisiting ‘The Rise and Decline’ in a Population of Peer Production Projects”. Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis. The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar. If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files. The data files are created in a four-stage process. The first stage uses the program “wikiq” to parse mediawiki xml dumps and create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which combines these tsvs into a dataset of edits from all the wikis. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript. A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs. If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 wikia_mediawiki_xml_dumps.7z.002, and wikia_mediawiki_xml_dumps.7z.003. These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets. Building the manuscript using knitr This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Overleaf.com. Download code.tar. This has everything you need to typeset the manuscript. Unpack the tar archive. On a unix system this can be done by running tar xf code.tar. Navigate to code/paper_source. Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on Overleaf.com. Loading intermediate datasets The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files. Running the analysis Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example. Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z. Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files. Generating datasets Building the intermediate files The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory. Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z. Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). Run 01_build_datasets.R. Building all.edits.RDS The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the int... Visit https://dataone.org/datasets/sha256%3Acfa4980c107154267d8eb6dc0753ed0fde655a73a062c0c2f5af33f237da3437 for complete metadata about this dataset.
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
all-recipes-xs (500)
All Recipes dataset (extra small). from datasets import load_dataset
dataset = load_dataset("AWeirdDev/all-recipes-xs")
Alternatively, load with pickle from _frozen.pkl: import pickle import requests
r = requests.get("https://huggingface.co/datasets/AWeirdDev/all-recipes-xs/resolve/main/_frozen.pkl") dataset = pickle.loads(r.content)
Features
Note: Empty values are presented as "unknown" instead of None (normally, unless… See the full description on the dataset page: https://huggingface.co/datasets/AWeirdDev/all-recipes-xs.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Author: Andrew J. Felton
Date: 10/29/2024
This R project contains the primary code and data (following pre-processing in python) used for data production, manipulation, visualization, and analysis, and figure production for the study entitled:
"Global estimates of the storage and transit time of water through vegetation"
Please note that 'turnover' and 'transit' are used interchangeably. Also please note that this R project has been updated multiple times as the analysis has updated.
Data information:
The data folder contains key data sets used for analysis. In particular:
"data/turnover_from_python/updated/august_2024_lc/" contains the core datasets used in this study including global arrays summarizing five year (2016-2020) averages of mean (annual) and minimum (monthly) transit time, storage, canopy transpiration, and number of months of data able as both an array (.nc) or data table (.csv). These data were produced in python using the python scripts found in the "supporting_code" folder. The remaining files in the "data" and "data/supporting_data"" folder primarily contain ground-based estimates of storage and transit found in public databases or through a literature search, but have been extensively processed and filtered here. The "supporting_data"" folder also contains annual (2016-2020) MODIS land cover data used in the analysis and contains separate filters containing the original data (.hdf) and then the final process (filtered) data in .nc format. The resulting annual land cover distributions were used in the pre-processing of data in python.
#Code information
Python scripts can be found in the "supporting_code" folder.
Each R script in this project has a role:
"01_start.R": This script sets the working directory, loads in the tidyverse package (the remaining packages in this project are called using the `::` operator), and can run two other scripts: one that loads the customized functions (02_functions.R) and one for importing and processing the key dataset for this analysis (03_import_data.R).
"02_functions.R": This script contains custom functions. Load this using the
`source()` function in the 01_start.R script.
"03_import_data.R": This script imports and processes the .csv transit data. It joins the mean (annual) transit time data with the minimum (monthly) transit data to generate one dataset for analysis: annual_turnover_2. Load this using the
`source()` function in the 01_start.R script.
"04_figures_tables.R": This is the main workhouse for figure/table production and
supporting analyses. This script generates the key figures and summary statistics
used in the study that then get saved in the manuscript_figures folder. Note that all
maps were produced using Python code found in the "supporting_code"" folder.
"supporting_generate_data.R": This script processes supporting data used in the analysis, primarily the varying ground-based datasets of leaf water content.
"supporting_process_land_cover.R": This takes annual MODIS land cover distributions and processes them through a multi-step filtering process so that they can be used in preprocessing of datasets in python.
Facebook
TwitterThis module series covers how to import, manipulate, format and plot time series data stored in .csv format in R. Originally designed to teach researchers to use NEON plant phenology and air temperature data; has been used in undergraduate classrooms.
Facebook
Twitterhttps://pasteur.epa.gov/license/sciencehub-license.htmlhttps://pasteur.epa.gov/license/sciencehub-license.html
Publicly available weblinks used to develop research project.
This dataset is associated with the following publication: Hall, E., R. Hall, J. Aron, S. Swanson, M. Philbin, R. Schaefer, T. Jones-Lepp, D. Heggem, J. Lin, E. Wilson, and H. Kahan. An Ecological Function Approach to Managing Harmful Cyanobacteria in Three Oregon Lakes: Beyond Water Quality Advisories and Total Maximum Daily Loads (TMDLs). WATER. MDPI AG, Basel, SWITZERLAND, 11(6): 1125, (2019).
Facebook
Twitterhttps://choosealicense.com/licenses/cc0-1.0/https://choosealicense.com/licenses/cc0-1.0/
Taken from Andrej Karpathy and repurposed for UofT's CSC 413 Deep Learning 40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
Usage
To use for e.g. character modelling: from datasets import load_dataset
dataset = load_dataset("r-three/shakespeare-dialogue")
for example… See the full description on the dataset page: https://huggingface.co/datasets/r-three/shakespeare-dialogue.
Facebook
TwitterABOUT DATASET
This is the R markdown notebook. It contains step by step guide for working on Data Analysis with R. It helps you with installing the relevant packages and how to load them. it also provides a detailed summary of the "dplyr" commands that you can use to manipulate your data in the R environment.
Anyone new to R and wish to carry out some data analysis on R can check it out!
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Wiens, S., van Berlekom, E., Szychowska, M., & Eklund, R. (2019). Visual Perceptual Load Does Not Affect the Frequency Mismatch Negativity. Frontiers in Psychology, 10(1970). doi:10.3389/fpsyg.2019.01970We manipulated visual perceptual load (high and low load) while we recorded electroencephalography. Event-related potentials (ERPs) were computed from these data.OSF_*.pdf contains the preregistration at open science framework (osf).https://doi.org/10.17605/OSF.IO/EWG9XERP_2019_rawdata_bdf.zip contains the raw eeg data files that were recorded with a biosemi system (www.biosemi.com). The files can be opened in matlab with the fieldtrip toolbox. https://www.mathworks.com/products/matlab.htmlhttp://www.fieldtriptoolbox.org/ERP_2019_visual_load_fieldtrip_scripts.zip contains all the matlab scripts that were used to process the ERP data with the toolbox fieldtrip. http://www.fieldtriptoolbox.org/ERP_2019_fieldtrip_mat_*.zip contain the final, preprocessed individual data files. They can be opened with matlab.ERP_2019_visual_load_python_scripts.zip contains the python scripts for the main task. They need python (https://www.python.org/) and psychopy (http://www.psychopy.org/)ERP_2019_visual_load_wmc_R_scripts.zip contains the R scripts to process the working memory capacity (wmc) data. https://www.r-project.org/.ERP_2019_visual_load_R_scripts.zip contains the R scripts to analyze the data and the output files with figures (eg scatterplots). https://www.r-project.org/.
Facebook
TwitterThe GitHub site contains the code and data to run two methods as shown in the manuscript. The first quantifies external load, internal load, and source attributed impact on cyanoHAB severity. This method was written using R language (version 4.2.2). The second method maps potential legacy P stores in upstream watersheds of Lake Mendota, WI (USA) and estimates the total P load per contributing area across watersheds. This method was written using Python language (version 3.8.5) & GeoPandas (version 0.8.2). 1. Impact_model_method The impact model method is a series of sequential programs that take original source data, performs quality control checks and re-formats the data as new data files for input into the impact model. The Impact_model_method folder has all programs used in this method. All code is written in R Markdown format (.Rmd). Here's the order to run the R programs and achieve the same results as shown in the manuscript: cyanoHAB_severity.Rmd - calculates cyanoHAB severity (as Chl-a and cyanobacteria density) TotalP_external_atmdep.Rmd - calculates external total P loads from atmospheric deposition TotalP_external_inflows.Rmd - calculates external total P loads from inflows (streams) TotalP_external_all.Rmd - calculates the sum of external total P loads (streams + atmospheric deposition) EpiVolume.Rmd - calculates the change in volume of the epilimnion using temperature profiles TotalP_water.Rmd - calculates the total P concentration in the epilimnion, total P concentration in the hypolimnion, and ratio of total P in hypolimnion to epilimnion (as alternative indicator of internal P load) TotalP_internal.Rmd - calculates internal total P loads (or alternatively the ratio of totalP_hypo:totalP_epi) Model_inputs.Rmd - prepares the data as input for impact modeling (defines predictor and response variables) Impact_model.Rmd - performs the statistical modeling for impact analysis on cyanoHAB severity by source (external vs internal) and outputs the partitioned sum of squares for each predictor term 2. Spatial_model_method The spatial model method takes original source data, performs quality control checks and re-formats the data as new data files for input into the spatial model. The model maps potential legacy P stores in upstream sub-watersheds, and quantify total P load per contributing area across sub-watersheds. All programs used in this method are provided in the Spatial_model_method folder. All code is written in Python language (.py or .ipynb). The specific order in which programs should be run to achieve the same results as shown in the manuscript is as follows: HydroGraph_functios.py - performs the network mapping of sub-watersheds upstream of the inflows to the lake watershed_analysis.ipynb - calculates the P export per contributing area for each of the upstream sub-watersheds OHSA.py - performs the optimized hot spot analysis for determining statistically clustered stream monitoring sites with consistently high P concentrations. This dataset is associated with the following publication: Knose, L.A., D.L. Cole, E. Martin-Hernandez, V.M. Zavala, M.A. Gonzalez, C. Vaneeckhaute, and G.J. Ruiz-Mercado. The Impact of Legacy Nutrient Loading from Lake Sediments on Cyanobacteria Bloom Severity. ACS ES&T Water. American Chemical Society, Washington, DC, USA, 5(9): 4997-5010, (2025).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This project is a collection of files to allow users to reproduce the model development and benchmarking in "Dawnn: single-cell differential abundance with neural networks" (Hall and Castellano, under review). Dawnn is a tool for detecting differential abundance in single-cell RNAseq datasets. It is available as an R package here. Please contact us if you are unable to reproduce any of the analysis in our paper. The files in this collection correspond to the benchmarking dataset based on single-cell RNAseq of mouse emrbyo cells.
FILES: Input data Dataset from: "A single-cell molecular map of mouse gastrulation and early organogenesis". Nature 566, pp490–495 (2019). The input data is loaded from the MouseGastrulationData R package. We upload here the RDS file generated by loading the dataset in process_mouse_cells.R in case the R package becomes unavailable
MouseGastrulationData_loaded_dataset.RDS Dataset loaded from MouseGastrulationData R package in process_mouse_cells.R (in call to EmbryoAtlasData function).
Data processing code
process_mouse_cells.R Generates benchmarking dataset from input data. (Loads input data; Runs the standard single-cell RNAseq pipeline). Follows Dann et al. Resulting dataset saved as mouse_gastrulation_data_regen.RDS. simulate_mouse_pc1_Rscript.R R code to simulate P(Condition_1)s for benchmarking. simulate_mouse_pc1_bash.sh Bash script to execute simulate_mouse_pc1_Rscript.R. Outputs stored in benchmark_dataset_mouse_pc1s_regen.csv. simulate_mouse_labels_Rscript.R R code to simulate labels for benchmarking. simulate_mouse_labels_bash.sh Bash script to execute simulate_mouse_labels_Rscript.R. Outputs stored in benchmark_dataset_mouse.csv.
Resulting datasets
mouse_gastrulation_data_regen.RDS Seurat dataset generated by process_mouse_cells.R. benchmark_dataset_mouse.csv Cell labels generated by simulate_mouse_labels_bash.sh. benchmark_dataset_mouse_pc1s_regen.csv P(Condition_1)s generated by simulate_mouse_pc1_bash.sh.
Facebook
TwitterThis data set features a hyperlink to the New York State Department of Transportation’s Posted Bridges web page, which includes a link to the Posted Bridges interactive map. The interface provides up-to-date information for locating Load-Posted, R-Posted and Other-Posted bridges throughout New York. This site has been developed as a reference tool for operators of Commercial Vehicles to identify and locate structures in New York State where access and use is limited based on the vehicle's weight. Legal weights are established in Section 385 of the Vehicle and Traffic Law. All trucks operating over legal weight pursuant to a Divisible Load Overweight Permit are prohibited from crossing all R-Posted Bridges.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
please use the following code to load data:
!git lfs install !git clone https://huggingface.co/datasets/nlp-guild/non-linear-classification
def load_dataset(path='dataset.npy'): """ :return: f_and_xs: numpy array of size [sample_number, channels, sample_length] label_0, label_1, label_2: one-hot encodes of size [sample_number, number_bins] """
r = np.load(path, allow_pickle=True).item()
f_and_xs = r['f_and_xs']
label_0 = r['l_0']… See the full description on the dataset page: https://huggingface.co/datasets/nlp-guild/non-linear-classification.
Facebook
TwitterBkdckedsvfukv
Dataset Description
Dataset for classification tasks
Dataset Statistics
Total Images: 2 Total Categories: 1 Total Labels: 0 Dataset Size: 1 MB Train Images: 1 (50.0%) Test Images: 1 (50.0%)
Usage Examples
Loading the Dataset
from datasets import load_dataset
dataset = load_dataset("karthiyayani/bkdckedsvfukv")
train_data = dataset['train']
for… See the full description on the dataset page: https://huggingface.co/datasets/karthiyayani/bkdckedsvfukv.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset holds the electric load profiles of 50 small and mid-size enterprises in Germany. The load profiles are in 15-minute time resolution for one year. The load is shown in kW as an average over 15 minutes.
The dataset is divided into two:
The IPs from the dataset for 2016 do not reappear in the dataset for 2017.
The dataset LoadProfile_20IPs_2016 is evaluated in the following publication:
Both datasets together are evaluated in multiple publications:
Enjoy.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Helios-R-6M
Helios-R-6M is a high-quality, compact reasoning dataset designed to strengthen multi-step problem solving across mathematics, computer science, and scientific inquiry. While the dataset covers a range of disciplines, math constitutes the largest share of examples and drives the reasoning complexity.
Quick Start with Hugging Face Datasets🤗
pip install -U datasets
from datasets import load_dataset
dataset = load_dataset("prithivMLmods/Helios-R-6M"… See the full description on the dataset page: https://huggingface.co/datasets/prithivMLmods/Helios-R-6M.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
In cooperation with the City of Durham Public Works Department Stormwater Division, the U.S. Geological Survey (USGS) conducted a study to evaluate whether alternate monitoring strategies that incorporated samples collected across an increased range of streamflows would improve nutrient load estimates for Ellerbe and Sandy Creeks, two small, highly urbanized streams in the City of Durham, North Carolina. This data release provides the associated datasets described in the Scientific Investigations Report, "Assessment of Nutrient Load Estimation Approaches for Small Urban Streams in Durham, North Carolina". Water-quality and streamflow data collected between January 2009 and December 2020 were used to develop instream nutrient-load models using the U.S. Geological Survey R-LOADEST program (Runkel and others, 2004; Runkel, 2013; Lorenz and others, 2017). The datasets contain water-quality data, streamflow data, input files for model calibration and prediction, and output files for mo ...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
An archive containing 100 artificial single-cell datasets. The data of each dataset is an R data file (.rda). The file names have the following format: splatter_thousands_of_cellskCells_groups_set_set_id.rda'For example: "splatter_1kCells_groups_set_9.rda" represents the 9th set containing 1000 cells made using the "groups" option of splatSimulate. You can import the data into R using the load() function. Each dataset includes the following R objects:1) counts : the number of reads or UMIs for each gene in each cell2) gene.data : a summary of the data generated by Splatter3) params : the parameters used to generate the dataset
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This book is written for statisticians, data analysts, programmers, researchers, teachers, students, professionals, and general consumers on how to perform different types of statistical data analysis for research purposes using the R programming language. R is an open-source software and object-oriented programming language with a development environment (IDE) called RStudio for computing statistics and graphical displays through data manipulation, modelling, and calculation. R packages and supported libraries provides a wide range of functions for programming and analyzing of data. Unlike many of the existing statistical softwares, R has the added benefit of allowing the users to write more efficient codes by using command-line scripting and vectors. It has several built-in functions and libraries that are extensible and allows the users to define their own (customized) functions on how they expect the program to behave while handling the data, which can also be stored in the simple object system.For all intents and purposes, this book serves as both textbook and manual for R statistics particularly in academic research, data analytics, and computer programming targeted to help inform and guide the work of the R users or statisticians. It provides information about different types of statistical data analysis and methods, and the best scenarios for use of each case in R. It gives a hands-on step-by-step practical guide on how to identify and conduct the different parametric and non-parametric procedures. This includes a description of the different conditions or assumptions that are necessary for performing the various statistical methods or tests, and how to understand the results of the methods. The book also covers the different data formats and sources, and how to test for reliability and validity of the available datasets. Different research experiments, case scenarios and examples are explained in this book. It is the first book to provide a comprehensive description and step-by-step practical hands-on guide to carrying out the different types of statistical analysis in R particularly for research purposes with examples. Ranging from how to import and store datasets in R as Objects, how to code and call the methods or functions for manipulating the datasets or objects, factorization, and vectorization, to better reasoning, interpretation, and storage of the results for future use, and graphical visualizations and representations. Thus, congruence of Statistics and Computer programming for Research.