Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Annotated 12 lead ECG dataset Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network". It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST). ## Folder content: - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exam. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V. In python, one can read this file using the following sequence: ```python import h5py with h5py.File(args.tracings, "r") as f: x = np.array(f['tracings']) ``` - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line. - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST` corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`). 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist. 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 3. `dnn.csv` prediction from the deep neural network described in "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way it maximizes the F1 score. 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset). 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset). 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).
Facebook
Twitterhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.htmlhttps://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html
Replication pack, FSE2018 submission #164: ------------------------------------------
**Working title:** Ecosystem-Level Factors Affecting the Survival of Open-Source Projects: A Case Study of the PyPI Ecosystem **Note:** link to data artifacts is already included in the paper. Link to the code will be included in the Camera Ready version as well. Content description =================== - **ghd-0.1.0.zip** - the code archive. This code produces the dataset files described below - **settings.py** - settings template for the code archive. - **dataset_minimal_Jan_2018.zip** - the minimally sufficient version of the dataset. This dataset only includes stats aggregated by the ecosystem (PyPI) - **dataset_full_Jan_2018.tgz** - full version of the dataset, including project-level statistics. It is ~34Gb unpacked. This dataset still doesn't include PyPI packages themselves, which take around 2TB. - **build_model.r, helpers.r** - R files to process the survival data (`survival_data.csv` in **dataset_minimal_Jan_2018.zip**, `common.cache/survival_data.pypi_2008_2017-12_6.csv` in **dataset_full_Jan_2018.tgz**) - **Interview protocol.pdf** - approximate protocol used for semistructured interviews. - LICENSE - text of GPL v3, under which this dataset is published - INSTALL.md - replication guide (~2 pages)
Replication guide ================= Step 0 - prerequisites ---------------------- - Unix-compatible OS (Linux or OS X) - Python interpreter (2.7 was used; Python 3 compatibility is highly likely) - R 3.4 or higher (3.4.4 was used, 3.2 is known to be incompatible) Depending on detalization level (see Step 2 for more details): - up to 2Tb of disk space (see Step 2 detalization levels) - at least 16Gb of RAM (64 preferable) - few hours to few month of processing time Step 1 - software ---------------- - unpack **ghd-0.1.0.zip**, or clone from gitlab: git clone https://gitlab.com/user2589/ghd.git git checkout 0.1.0 `cd` into the extracted folder. All commands below assume it as a current directory. - copy `settings.py` into the extracted folder. Edit the file: * set `DATASET_PATH` to some newly created folder path * add at least one GitHub API token to `SCRAPER_GITHUB_API_TOKENS` - install docker. For Ubuntu Linux, the command is `sudo apt-get install docker-compose` - install libarchive and headers: `sudo apt-get install libarchive-dev` - (optional) to replicate on NPM, install yajl: `sudo apt-get install yajl-tools` Without this dependency, you might get an error on the next step, but it's safe to ignore. - install Python libraries: `pip install --user -r requirements.txt` . - disable all APIs except GitHub (Bitbucket and Gitlab support were not yet implemented when this study was in progress): edit `scraper/init.py`, comment out everything except GitHub support in `PROVIDERS`. Step 2 - obtaining the dataset ----------------------------- The ultimate goal of this step is to get output of the Python function `common.utils.survival_data()` and save it into a CSV file: # copy and paste into a Python console from common import utils survival_data = utils.survival_data('pypi', '2008', smoothing=6) survival_data.to_csv('survival_data.csv') Since full replication will take several months, here are some ways to speedup the process: ####Option 2.a, difficulty level: easiest Just use the precomputed data. Step 1 is not necessary under this scenario. - extract **dataset_minimal_Jan_2018.zip** - get `survival_data.csv`, go to the next step ####Option 2.b, difficulty level: easy Use precomputed longitudinal feature values to build the final table. The whole process will take 15..30 minutes. - create a folder `
Facebook
TwitterThe following data shows riding information for members vs casual riders at the company Cyclistic(made up name). This is a dataset used as a case study for the google data analytics certificate.
The Changes Done to the Data in Excel: - Removed all duplicated (none were found) - Added a ride_length column by subtracting ended_at by started_at using the following formula "=C2-B2" and then turned that type into a Time, 37:30:55 - Added a day_of_week column using the following formula "=WEEKDAY(B2,1)" to display the day the ride took place on, 1= sunday through 7=saturday. - There was data that can be seen as ########, that data was left the same with no changes done to it, this data simply represents negative data and should just be looked at as 0.
Processing the Data in RStudio: - Installed required packages such as tidyverse for data import and wrangling, lubridate for date functions and ggplot for visualization. - Step 1: I read the csv files into R to collect the data - Step 2: Made sure the data all contained the same column names because I want to merge them into one - Step 3: Renamed all column names to make sure they align, then merged them into one combined data - Step 4: More data cleaning and analyzing - Step 5: Once my data was cleaned and clearly telling a story, I began to visualize it. The visualizations done can be seen below.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Pathogen diversity resulting in quasispecies can enable persistence and adaptation to host defenses and therapies. However, accurate quasispecies characterization can be impeded by errors introduced during sample handling and sequencing which can require extensive optimizations to overcome. We present complete laboratory and bioinformatics workflows to overcome many of these hurdles. The Pacific Biosciences single molecule real-time platform was used to sequence PCR amplicons derived from cDNA templates tagged with universal molecular identifiers (SMRT-UMI). Optimized laboratory protocols were developed through extensive testing of different sample preparation conditions to minimize between-template recombination during PCR and the use of UMI allowed accurate template quantitation as well as removal of point mutations introduced during PCR and sequencing to produce a highly accurate consensus sequence from each template. Handling of the large datasets produced from SMRT-UMI sequencing was facilitated by a novel bioinformatic pipeline, Probabilistic Offspring Resolver for Primer IDs (PORPIDpipeline), that automatically filters and parses reads by sample, identifies and discards reads with UMIs likely created from PCR and sequencing errors, generates consensus sequences, checks for contamination within the dataset, and removes any sequence with evidence of PCR recombination or early cycle PCR errors, resulting in highly accurate sequence datasets. The optimized SMRT-UMI sequencing method presented here represents a highly adaptable and established starting point for accurate sequencing of diverse pathogens. These methods are illustrated through characterization of human immunodeficiency virus (HIV) quasispecies.
Methods
This serves as an overview of the analysis performed on PacBio sequence data that is summarized in Analysis Flowchart.pdf and was used as primary data for the paper by Westfall et al. "Optimized SMRT-UMI protocol produces highly accurate sequence datasets from diverse populations – application to HIV-1 quasispecies"
Five different PacBio sequencing datasets were used for this analysis: M027, M2199, M1567, M004, and M005
For the datasets which were indexed (M027, M2199), CCS reads from PacBio sequencing files and the chunked_demux_config files were used as input for the chunked_demux pipeline. Each config file lists the different Index primers added during PCR to each sample. The pipeline produces one fastq file for each Index primer combination in the config. For example, in dataset M027 there were 3–4 samples using each Index combination. The fastq files from each demultiplexed read set were moved to the sUMI_dUMI_comparison pipeline fastq folder for further demultiplexing by sample and consensus generation with that pipeline. More information about the chunked_demux pipeline can be found in the README.md file on GitHub.
The demultiplexed read collections from the chunked_demux pipeline or CCS read files from datasets which were not indexed (M1567, M004, M005) were each used as input for the sUMI_dUMI_comparison pipeline along with each dataset's config file. Each config file contains the primer sequences for each sample (including the sample ID block in the cDNA primer) and further demultiplexes the reads to prepare data tables summarizing all of the UMI sequences and counts for each family (tagged.tar.gz) as well as consensus sequences from each sUMI and rank 1 dUMI family (consensus.tar.gz). More information about the sUMI_dUMI_comparison pipeline can be found in the paper and the README.md file on GitHub.
The consensus.tar.gz and tagged.tar.gz files were moved from sUMI_dUMI_comparison pipeline directory on the server to the Pipeline_Outputs folder in this analysis directory for each dataset and appended with the dataset name (e.g. consensus_M027.tar.gz). Also in this analysis directory is a Sample_Info_Table.csv containing information about how each of the samples was prepared, such as purification methods and number of PCRs. There are also three other folders: Sequence_Analysis, Indentifying_Recombinant_Reads, and Figures. Each has an .Rmd file with the same name inside which is used to collect, summarize, and analyze the data. All of these collections of code were written and executed in RStudio to track notes and summarize results.
Sequence_Analysis.Rmd has instructions to decompress all of the consensus.tar.gz files, combine them, and create two fasta files, one with all sUMI and one with all dUMI sequences. Using these as input, two data tables were created, that summarize all sequences and read counts for each sample that pass various criteria. These are used to help create Table 2 and as input for Indentifying_Recombinant_Reads.Rmd and Figures.Rmd. Next, 2 fasta files containing all of the rank 1 dUMI sequences and the matching sUMI sequences were created. These were used as input for the python script compare_seqs.py which identifies any matched sequences that are different between sUMI and dUMI read collections. This information was also used to help create Table 2. Finally, to populate the table with the number of sequences and bases in each sequence subset of interest, different sequence collections were saved and viewed in the Geneious program.
To investigate the cause of sequences where the sUMI and dUMI sequences do not match, tagged.tar.gz was decompressed and for each family with discordant sUMI and dUMI sequences the reads from the UMI1_keeping directory were aligned using geneious. Reads from dUMI families failing the 0.7 filter were also aligned in Genious. The uncompressed tagged folder was then removed to save space. These read collections contain all of the reads in a UMI1 family and still include the UMI2 sequence. By examining the alignment and specifically the UMI2 sequences, the site of the discordance and its case were identified for each family as described in the paper. These alignments were saved as "Sequence Alignments.geneious". The counts of how many families were the result of PCR recombination were used in the body of the paper.
Using Identifying_Recombinant_Reads.Rmd, the dUMI_ranked.csv file from each sample was extracted from all of the tagged.tar.gz files, combined and used as input to create a single dataset containing all UMI information from all samples. This file dUMI_df.csv was used as input for Figures.Rmd.
Figures.Rmd used dUMI_df.csv, sequence_counts.csv, and read_counts.csv as input to create draft figures and then individual datasets for eachFigure. These were copied into Prism software to create the final figures for the paper.
Facebook
TwitterWelcome to my Kickstarter case study! In this project I’m trying to understand what the success’s factors for a Kickstarter campaign are, analyzing an available public dataset from Web Robots. The process of analysis will follow the data analysis roadmap: ASK, PREPARE, PROCESS, ANALYZE, SHARE and ACT.
ASK
Different questions will guide my analysis: 1. Is the campaign duration influencing the success of the project? 2. Is it the chosen funding budget? 3. Which category of campaign is the most likely to be successful?
PREPARE
I’m using the Kickstarter Datasets publicly available on Web Robots. Data are scraped using a bot which collects the data in CSV format once a month and all the data are divided into CSV files. Each table contains: - backers_count : number of people that contributed to the campaign - blurb : a captivating text description of the project - category : the label categorizing the campaign (technology, art, etc) - country - created_at : day and time of campaign creation - deadline : day and time of campaign max end - goal : amount to be collected - launched_at : date and time of campaign launch - name : name of campaign - pledged : amount of money collected - state : success or failure of the campaign
Each month scraping produce a huge amount of CSVs, so for an initial analysis I decided to focus on three months: November and December 2023, and January 2024. I’ve downloaded zipped files which once unzipped contained respectively: 7 CSVs (November 2023), 8 CSVs (December 2023), 8 CSVs (January 2024). Each month was divided into a specific folder.
Having a first look at the spreadsheets, it’s clear that there is some need for cleaning and modification: for example, dates and times are shown in Unix code, there are multiple columns that are not helpful for the scope of my analysis, currencies need to be uniformed (some are US$, some GB£, etc). In general, I have all the data that I need to answer my initial questions, identify trends, and make predictions.
PROCESS
I decided to use R to clean and process the data. For each month I started setting a new working environment in its own folder. After loading the necessary libraries:
R
library(tidyverse)
library(lubridate)
library(ggplot2)
library(dplyr)
library(tidyr)
I scripted a general R code that searches for CSVs files in the folder, open them as separate variable and into a single data frame:
csv_files <- list.files(pattern = "\\.csv$")
data_frames <- list()
for (file in csv_files) {
variable_name <- sub("\\.csv$", "", file)
assign(variable_name, read.csv(file))
data_frames[[variable_name]] <- get(variable_name)
}
Next, I converted some columns in numeric values because I was running into types error when trying to merge all the CSVs into a single comprehensive file.
data_frames <- lapply(data_frames, function(df) {
df$converted_pledged_amount <- as.numeric(df$converted_pledged_amount)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_exchange_rate <- as.numeric(df$usd_exchange_rate)
return(df)
})
data_frames <- lapply(data_frames, function(df) {
df$usd_pledged <- as.numeric(df$usd_pledged)
return(df)
})
In each folder I then ran a command to merge the CSVs in a single file (one for November 2023, one for December 2023 and one for January 2024):
all_nov_2023 = bind_rows(data_frames)
all_dec_2023 = bind_rows(data_frames)
all_jan_2024 = bind_rows(data_frames)`
After merging I converted the UNIX code datestamp into a readable datetime for the columns “created”, “launched”, “deadline” and deleted all the columns that had these data set to 0. I also filtered the values into the “slug” columns to show only the category of the campaign, without unnecessary information for the scope of my analysis. The final table was then saved.
filtered_dec_2023 <- all_dec_2023 %>% #this was modified according to the considered month
select(blurb, backers_count, category, country, created_at, launched_at, deadline,currency, usd_exchange_rate, goal, pledged, state) %>%
filter(created_at != 0 & deadline != 0 & launched_at != 0) %>%
mutate(category_slug = sub('.*?"slug":"(.*?)".*', '\\1', category)) %>%
mutate(created = as.POSIXct(created_at, origin = "1970-01-01")) %>%
mutate(launched = as.POSIXct(launched_at, origin = "1970-01-01")) %>%
mutate(setted_deadline = as.POSIXct(deadline, origin = "1970-01-01")) %>%
select(-category, -deadline, -launched_at, -created_at) %>%
relocate(created, launched, setted_deadline, .before = goal)
write.csv(filtered_dec_2023, "filtered_dec_2023.csv", row.names = FALSE)
The three generated files were then merged into one comprehensive CSV called "kickstarter_cleaned" which was further modified, converting a...
Facebook
Twitterhttps://opensource.org/licenses/BSD-2-Clausehttps://opensource.org/licenses/BSD-2-Clause
This data repository feeds into the meta-repository setup for post-processing of GCAM-SAM outputs. GitHub link of meta-repository is: https://github.com/JGCRI/Kyle-etal_2022_EF
Folders: model/ is the static version of the model used to simulate 8 scenarios. See the GitHub GCAM-SAM repository to follow active development of this model. inputs/ folder contains input datasets and scripts used to prepare files while postprocessing. This is to be used with GitHub post-processing meta-repository. outdata/ contains GCAM-SAM output and post-processed output files used to plot figures.
Key files: SAM-matrix.dat is the consolidated GCAM-SAM output. Use proj_load.R in the metarepo to read the file. region_vals.csv has all 8 indicators in all 8 scenarios for years 2020 till 2100 on a 10 year time step.
Short introduction to the study:
In this paper sustainable agriculture matrix (SAM) is estimated to 2100 using Global Change Analysis Model (GCAM). We model combinatorial variations of yield intensification, dietary shift, and greenhouse gas mitigation scenarios. Findings include scenarios having significant tradeoffs across multiple environmental, economic, and social dimensions. Assessment of these multi-dimensional tradeoffs in a consistent framework improves the quality of information for decision-making.
Should you have any questions, feel free to reach out Page Kyle at pkyle@pnnl.gov.
Facebook
TwitterTo make this a seamless process, I cleaned the data and delete many variables that I thought were not important to our dataset. I then uploaded all of those files to Kaggle for each of you to download. The rideshare_data has both lyft and uber but it is still a cleaned version from the dataset we downloaded from Kaggle.
You can easily subset the data into the car types that you will be modeling by first loading the csv into R, here is the code for how you do this:
df<-read.csv('uber.csv')
df_black<-subset(uber_df, uber_df$name == 'Black')
write.csv(df_black, "nameofthefileyouwanttosaveas.csv")
getwd()
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This resource contains the data and scripts used for: Goeking, S. A. and D. G. Tarboton, (2022). Spatially distributed overstory and understory leaf area index estimated from forest inventory data. Water. https://doi.org/10.3390/w1415241.
Abstract from the paper: Abstract: Forest change affects the relative magnitudes of hydrologic fluxes such as evapotranspiration (ET) and streamflow. However, much is unknown about the sensitivity of streamflow response to forest disturbance and recovery. Several physically based models recognize the different influences that overstory versus understory canopies exert on hydrologic processes, yet most input datasets consist of total leaf area index (LAI) rather than individual canopy strata. Here, we developed stratum-specific LAI datasets with the intent of improving the representation of vegetation for ecohydrologic modeling. We applied three pre-existing methods for estimating overstory LAI, and one new method for estimating both overstory and understory LAI, to measurements collected from a probability-based plot network established by the US Forest Service’s Forest Inventory and Analysis (FIA) program, for a modeling domain in Montana, MT, USA. We then combined plot-level LAI estimates with spatial datasets (i.e., biophysical and re-mote sensing predictors) in a machine learning algorithm (random forests) to produce annual gridded LAI datasets. Methods that estimate only overstory LAI tended to underestimate LAI relative to Landsat-based LAI (mean bias error ≥ 0.83), while the method that estimated both overstory and understory layers was most strongly correlated with Landsat-based LAI (r2 = 0.80 for total LAI, with mean bias error of -0.99). During 1984-2019, interannual variability of under-story LAI exceeded that for overstory LAI; this variability may affect partitioning of precipitation to ET vs. runoff at annual timescales. We anticipate that distinguishing overstory and understory components of LAI will improve the ability of LAI-based models to simulate how for-est change influences hydrologic processes.
This resource contains one CSV file, two shapefiles (each within a zip file), two R scripts, and multiple raster datasets. The two shapefiles represent the boundaries of the Middle Fork Flathead river and South Fork Flathead River watersheds. The raster datasets represent annual leaf area index (LAI) at 30 m resolution for the entire modeling domain used in this study. LAI was estimated using method LAI4, which produced separate overstory and understory LAI datasets. Filenames contain years, e.g., "LAI4_2019" is overstory LAI for 2019; "LAI4under_2019" is understory LAI for 2019.
The CSV files in this Resource contain annual time series of LAI and ET ratio (annual evapotranspiration divided by annual precipitation) for the South Fork Flathead River and Middle Fork Flathead River watersheds, 1984-2019. LAI methods represented in this time series are LAI1 and LAI4 from the paper. LAI1 consists of only overstory LAI, and LAI4 consists of overstory (LAI4), understory (LAI4_under), and total (LAI4_total) LAI. For each LAI estimation method, summary statistics of the entire watershed are included (min, first quartile, median, third quartile, and max).
The two R scripts (R language and environment for statistical computing) summarize Forest Inventory & Analysis (FIA) data from the FIA database (FIADB) to estimate LAI at FIA plots. 1) FIADB_queries_public.r: Script for compiling FIA plot measurements prior to estimating LAI 2) LAI_estimation_public: Script for estimating LAI at FIA plots using the four methods described in this paper
Before running the R scripts, users must obtain several FIADB tables (PLOT, COND, TREE, and P2VEG_SUBP_STRUCTURE; all four tables must be renamed with lower-case names, e.g., "plot"). These tables can be obtained using one of two methods: 1) By downloading CSV files for the appropriate U.S. state(s) from the FIA DataMart (https://apps.fs.usda.gov/fia/datamart/datamart.html). If this method is used, the CSV files must be imported (read) into R before proceeding. 2) By using r package 'rFIA' to download the tables from FIADB for the U.S. state(s) of interest.
Note that publicly available plot coordinates are accurate within 1 km and are not true plot locations, which are legally confidential to protect the integrity of the sample locations and the privacy of landowners. Access to true plot location data requires review by FIA's Spatial Data Services unit, who can be contacted at SM.FS.RMRSFIA_Help@usda.gov.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BioTIME database contains raw data on species identities and abundances in ecological assemblages through time. The database consists of 11 tables; one raw data table plus ten related meta data tables. For further information please see our associated data paper.
This data consists of several elements:
Please note: any users of any of this material should cite the associated data paper in addition to the DOI listed here.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The objective behind attempting this dataset was to understand the predictors that contribute to the life expectancy around the world. I have used Linear Regression, Decision Tree and Random Forest for this purpose. Steps Involved: - Read the csv file - Data Cleaning: - Variables Country and Status were showing as having character data types. These had to be converted to factor - 2563 missing values were encountered with Population variable having the most of the missing values i.e 652 - Missing rows were dropped before we could run the analysis. 3) Run Linear Regression - Before running linear regression, 3 variables were dropped as they were not found to be having that much of an effect on the dependent variable i.e Life Expectancy. These 3 variables were Country, Year & Status. This meant we are now working with 19 variables (1 dependent and 18 independent variables) - We run the linear regression. Multiple R squared is 83% which means that independent variables can explain 83% change or variance in the dependent variable. - OULTLIER DETECTION. We check for outliers using IQR and find 54 outliers. These outliers are then removed before we run the regression analysis once again. Multiple R squared increased from 83% to 86%. - MULTICOLLINEARITY. We check for multicollinearity using the VIF model(Variance Inflation Factor). This is being done in case when two or more independent variables showing high correlation. The thumb rule is that absolute VIF values above 5 should be removed. We find 6 variables that have a VIF value higher than 5 namely Infant.deaths, percentage.expenditure,Under.five.deaths,GDP,thinness1.19,thinness5.9. Infant deaths and Under Five deaths have strong collinearity so we drop infant deaths(which has the higher VIF value). - When we run the linear regression model again, VIF value of Under.Five.Deaths goes down from 211.46 to 2.74 while the other variable's VIF values reduce very less. Variable thinness1.19 is now dropped and we run the regression once more. - Variable thinness5.9 whose absolute VIF value was 7.61 has now dropped to 1.95. GDP and Population are still having VIF value more than 5 but I decided against dropping these as I consider them to be important independent variables. - SET THE SEED AND SPLIT THE DATA INTO TRAIN AND TEST DATA. We run the train data and get multiple R squared of 86% and p value less than that of alpha which states that it is statistically significant. We use the train data to predict the test data to find out the RMSE and MAPE. We run the library(Metrics) for this purpose. - In Linear Regression, RMSE (Root Mean Squared Error) is 3.2. This indicates that on an average, the predicted values have an error of 3.2 years as compared to the actual life expectancy values. - MAPE (Mean Absolute Percentage Error) is 0.037. This indicates an accuracy prediction of 96.20% (1-0.037). - MAE (Mean Absolute Error) is 2.55. This indicates that on an average, the predicted values deviate by approximately 2.83 years from the actual values.
Conclusion: Random Forest is the best model for predicting the life expectancy values as it has the lowest RMSE, MAPE and MAE.
Facebook
TwitterThis dataset contains files reconstructing single-cell data presented in 'Reference transcriptomics of porcine peripheral immune cells created through bulk and single-cell RNA sequencing' by Herrera-Uribe & Wiarda et al. 2021. Samples of peripheral blood mononuclear cells (PBMCs) were collected from seven pigs and processed for single-cell RNA sequencing (scRNA-seq) in order to provide a reference annotation of porcine immune cell transcriptomics at enhanced, single-cell resolution. Analysis of single-cell data allowed identification of 36 cell clusters that were further classified into 13 cell types, including monocytes, dendritic cells, B cells, antibody-secreting cells, numerous populations of T cells, NK cells, and erythrocytes. Files may be used to reconstruct the data as presented in the manuscript, allowing for individual query by other users. Scripts for original data analysis are available at https://github.com/USDA-FSEPRU/PorcinePBMCs_bulkRNAseq_scRNAseq. Raw data are available at https://www.ebi.ac.uk/ena/browser/view/PRJEB43826.
Funding for this dataset was also provided by NRSP8: National Animal Genome Research Program (https://www.nimss.org/projects/view/mrp/outline/18464).
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells 10X Format.
File Name: PBMC7_AllCells.zip
Resource Description: Zipped folder containing PBMC counts matrix, gene names, and cell IDs. Files are as follows:
*The ‘raw’ count matrix is actually gene counts obtained following ambient RNA removal. During ambient RNA removal, we specified to calculate non-integer count estimations, so most gene counts are actually non-integer values in this matrix but should still be treated as raw/unnormalized data that requires further normalization/transformation.
Data can be read into R using the function Read10X().
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells Metadata.
File Name: PBMC7_AllCells_meta.csv
Resource Description: .csv file containing metadata for cells included in the final dataset. Metadata columns include:
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells PCA Coordinates.
File Name: PBMC7_AllCells_PCAcoord.csv
Resource Description: .csv file containing first 100 PCA coordinates for cells.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells t-SNE Coordinates.
File Name: PBMC7_AllCells_tSNEcoord.csv
Resource Description: .csv file containing t-SNE coordinates for all cells.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells UMAP Coordinates.
File Name: PBMC7_AllCells_UMAPcoord.csv
Resource Description: .csv file containing UMAP coordinates for all cells.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells t-SNE Coordinates.
File Name: PBMC7_CD4only_tSNEcoord.csv
Resource Description: .csv file containing t-SNE coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - CD4 T Cells UMAP Coordinates.
File Name: PBMC7_CD4only_UMAPcoord.csv
Resource Description: .csv file containing UMAP coordinates for only CD4 T cells (clusters 0, 3, 4, 28). A dataset of only CD4 T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells UMAP Coordinates.
File Name: PBMC7_GDonly_UMAPcoord.csv
Resource Description: .csv file containing UMAP coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and UMAP coordinates used in publication can be re-assigned using this .csv file.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gamma Delta T Cells t-SNE Coordinates.
File Name: PBMC7_GDonly_tSNEcoord.csv
Resource Description: .csv file containing t-SNE coordinates for only gamma delta T cells (clusters 6, 21, 24, 31). A dataset of only gamma delta T cells can be re-created from the PBMC7_AllCells.h5Seurat, and t-SNE coordinates used in publication can be re-assigned using this .csv file.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - Gene Annotation Information.
File Name: UnfilteredGeneInfo.txt
Resource Description: .txt file containing gene nomenclature information used to assign gene names in the dataset. 'Name' column corresponds to the name assigned to a feature in the dataset.
Resource Title: Herrera-Uribe & Wiarda et al. PBMCs - All Cells H5Seurat.
File Name: PBMC7.tar
Resource Description: .h5Seurat object of all cells in PBMC dataset. File needs to be untarred, then read into R using function LoadH5Seurat().
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset is the repository for the following paper submitted to Data in Brief:
Kempf, M. A dataset to model Levantine landcover and land-use change connected to climate change, the Arab Spring and COVID-19. Data in Brief (submitted: December 2023).
The Data in Brief article contains the supplement information and is the related data paper to:
Kempf, M. Climate change, the Arab Spring, and COVID-19 - Impacts on landcover transformations in the Levant. Journal of Arid Environments (revision submitted: December 2023).
Description/abstract
The Levant region is highly vulnerable to climate change, experiencing prolonged heat waves that have led to societal crises and population displacement. Since 2010, the area has been marked by socio-political turmoil, including the Syrian civil war and currently the escalation of the so-called Israeli-Palestinian Conflict, which strained neighbouring countries like Jordan due to the influx of Syrian refugees and increases population vulnerability to governmental decision-making. Jordan, in particular, has seen rapid population growth and significant changes in land-use and infrastructure, leading to over-exploitation of the landscape through irrigation and construction. This dataset uses climate data, satellite imagery, and land cover information to illustrate the substantial increase in construction activity and highlights the intricate relationship between climate change predictions and current socio-political developments in the Levant.
Folder structure
The main folder after download contains all data, in which the following subfolders are stored are stored as zipped files:
“code” stores the above described 9 code chunks to read, extract, process, analyse, and visualize the data.
“MODIS_merged” contains the 16-days, 250 m resolution NDVI imagery merged from three tiles (h20v05, h21v05, h21v06) and cropped to the study area, n=510, covering January 2001 to December 2022 and including January and February 2023.
“mask” contains a single shapefile, which is the merged product of administrative boundaries, including Jordan, Lebanon, Israel, Syria, and Palestine (“MERGED_LEVANT.shp”).
“yield_productivity” contains .csv files of yield information for all countries listed above.
“population” contains two files with the same name but different format. The .csv file is for processing and plotting in R. The .ods file is for enhanced visualization of population dynamics in the Levant (Socio_cultural_political_development_database_FAO2023.ods).
“GLDAS” stores the raw data of the NASA Global Land Data Assimilation System datasets that can be read, extracted (variable name), and processed using code “8_GLDAS_read_extract_trend” from the respective folder. One folder contains data from 1975-2022 and a second the additional January and February 2023 data.
“built_up” contains the landcover and built-up change data from 1975 to 2022. This folder is subdivided into two subfolder which contain the raw data and the already processed data. “raw_data” contains the unprocessed datasets and “derived_data” stores the cropped built_up datasets at 5 year intervals, e.g., “Levant_built_up_1975.tif”.
Code structure
1_MODIS_NDVI_hdf_file_extraction.R
This is the first code chunk that refers to the extraction of MODIS data from .hdf file format. The following packages must be installed and the raw data must be downloaded using a simple mass downloader, e.g., from google chrome. Packages: terra. Download MODIS data from after registration from: https://lpdaac.usgs.gov/products/mod13q1v061/ or https://search.earthdata.nasa.gov/search (MODIS/Terra Vegetation Indices 16-Day L3 Global 250m SIN Grid V061, last accessed, 09th of October 2023). The code reads a list of files, extracts the NDVI, and saves each file to a single .tif-file with the indication “NDVI”. Because the study area is quite large, we have to load three different (spatially) time series and merge them later. Note that the time series are temporally consistent.
2_MERGE_MODIS_tiles.R
In this code, we load and merge the three different stacks to produce large and consistent time series of NDVI imagery across the study area. We further use the package gtools to load the files in (1, 2, 3, 4, 5, 6, etc.). Here, we have three stacks from which we merge the first two (stack 1, stack 2) and store them. We then merge this stack with stack 3. We produce single files named NDVI_final_*consecutivenumber*.tif. Before saving the final output of single merged files, create a folder called “merged” and set the working directory to this folder, e.g., setwd("your directory_MODIS/merged").
3_CROP_MODIS_merged_tiles.R
Now we want to crop the derived MODIS tiles to our study area. We are using a mask, which is provided as .shp file in the repository, named "MERGED_LEVANT.shp". We load the merged .tif files and crop the stack with the vector. Saving to individual files, we name them “NDVI_merged_clip_*consecutivenumber*.tif. We now produced single cropped NDVI time series data from MODIS.
The repository provides the already clipped and merged NDVI datasets.
4_TREND_analysis_NDVI.R
Now, we want to perform trend analysis from the derived data. The data we load is tricky as it contains 16-days return period across a year for the period of 22 years. Growing season sums contain MAM (March-May), JJA (June-August), and SON (September-November). December is represented as a single file, which means that the period DJF (December-February) is represented by 5 images instead of 6. For the last DJF period (December 2022), the data from January and February 2023 can be added. The code selects the respective images from the stack, depending on which period is under consideration. From these stacks, individual annually resolved growing season sums are generated and the slope is calculated. We can then extract the p-values of the trend and characterize all values with high confidence level (0.05). Using the ggplot2 package and the melt function from reshape2 package, we can create a plot of the reclassified NDVI trends together with a local smoother (LOESS) of value 0.3.
To increase comparability and understand the amplitude of the trends, z-scores were calculated and plotted, which show the deviation of the values from the mean. This has been done for the NDVI values as well as the GLDAS climate variables as a normalization technique.
5_BUILT_UP_change_raster.R
Let us look at the landcover changes now. We are working with the terra package and get raster data from here: https://ghsl.jrc.ec.europa.eu/download.php?ds=bu (last accessed 03. March 2023, 100 m resolution, global coverage). Here, one can download the temporal coverage that is aimed for and reclassify it using the code after cropping to the individual study area. Here, I summed up different raster to characterize the built-up change in continuous values between 1975 and 2022.
6_POPULATION_numbers_plot.R
For this plot, one needs to load the .csv-file “Socio_cultural_political_development_database_FAO2023.csv” from the repository. The ggplot script provided produces the desired plot with all countries under consideration.
7_YIELD_plot.R
In this section, we are using the country productivity from the supplement in the repository “yield_productivity” (e.g., "Jordan_yield.csv". Each of the single country yield datasets is plotted in a ggplot and combined using the patchwork package in R.
8_GLDAS_read_extract_trend
The last code provides the basis for the trend analysis of the climate variables used in the paper. The raw data can be accessed https://disc.gsfc.nasa.gov/datasets?keywords=GLDAS%20Noah%20Land%20Surface%20Model%20L4%20monthly&page=1 (last accessed 9th of October 2023). The raw data comes in .nc file format and various variables can be extracted using the [“^a variable name”] command from the spatraster collection. Each time you run the code, this variable name must be adjusted to meet the requirements for the variables (see this link for abbreviations: https://disc.gsfc.nasa.gov/datasets/GLDAS_CLSM025_D_2.0/summary, last accessed 09th of October 2023; or the respective code chunk when reading a .nc file with the ncdf4 package in R) or run print(nc) from the code or use names(the spatraster collection).
Choosing one variable, the code uses the MERGED_LEVANT.shp mask from the repository to crop and mask the data to the outline of the study area.
From the processed data, trend analysis are conducted and z-scores were calculated following the code described above. However, annual trends require the frequency of the time series analysis to be set to value = 12. Regarding, e.g., rainfall, which is measured as annual sums and not means, the chunk r.sum=r.sum/12 has to be removed or set to r.sum=r.sum/1 to avoid calculating annual mean values (see other variables). Seasonal subset can be calculated as described in the code. Here, 3-month subsets were chosen for growing seasons, e.g. March-May (MAM), June-July (JJA), September-November (SON), and DJF (December-February, including Jan/Feb of the consecutive year).
From the data, mean values of 48 consecutive years are calculated and trend analysis are performed as describe above. In the same way, p-values are extracted and 95 % confidence level values are marked with dots on the raster plot. This analysis can be performed with a much longer time series, other variables, ad different spatial extent across the globe due to the availability of the GLDAS variables.
Facebook
TwitterThis data package is associated with the publication “Investigating the impacts of solid phase extraction on dissolved organic matter optical signatures and the pairing with high-resolution mass spectrometry data in a freshwater system” submitted to “Limnology and Oceanography: Methods.” This data is an extension of the River Corridor and Watershed Biogeochemistry SFA’s Spatial Study 2021 (https://doi.org/10.15485/1898914). Other associated data and field metadata can be found at the link provided. The goal of this manuscript is to assess the impact of solid phase extraction (SPE) on the ability to pair ultra-high resolution mass spectrometry data collected from SPE extracts with optical properties collected on ambient stream samples. Forty-seven samples collected from within the Yakima River Basin, Washington were analyzed dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC), absorbance, and fluorescence. Samples were subsequently concentrated with SPE and reanalyzed for each measurement. The extraction efficiency for the DOC and common optical indices were calculated. In addition, SPE samples were subject to ultra-high resolution mass spectrometry and compared with the ambient and SPE generated optical data. Finally, in addition to this cross-platform inter-comparison, we further performed and intra-comparison among the high-resolution mass spectrometry data to determine the impact of sample preparation on the interpretability of results. Here, the SPE samples were prepared at 40 milligrams per liter (mg/L) based on the known DOC extraction efficiency of the samples (ranging from ~30 to ~75%) compared to the common practice of assuming the DOC extraction efficiency of freshwater samples at 60%. This data package folder consists of one main data folder with one subfolder (Data_Input). The main data folder contains (1) readme; (2) data dictionary (dd); (3) file-level metadata (flmd); (4) final data summary output from processing script; and (5) the processing script. The R-markdown processing script (SPE_Manuscript_Rmarkdown_Data_Package.rmd) contains all code needed to reproduce manuscript statistics and figures (with the exception of that stated below). The Data_Input folder has two subfolders: (1) FTICR and (2) Optics. Additionally, the Data_Input folder contains dissolved organic carbon (DOC, measured as non-purgeable organic carbon, NPOC) data (SPS_NPOC_Summary.csv) and relevant supporting Solid Phase Extraction Volume information (SPS_SPE_Volumes.csv). Methods information for the optical and FTICR data is embedded in the header rows of SPS_EEMs_Methods.csv and SPS_FTICR_Methods.csv, respectively. In addition, the data dictionary (SPS_SPE_dd.csv), file level metadata (SPS_SPE_flmd.csv), and methods codes (SPS_SPE_Methods_codes.csv) are provided. The FTICR subfolder contains all raw FTICR data as well as instructions for processing. In addition, post processed FTICR molecular information (Processed_FTICRMS_Mol.csv) and sample data (Processed_FTICRMS_Data.csv) is provided that can be directly read into R with the associated R-markdown file. The Optics subfolder contains all Absorbance and Fluorescence Spectra. Fluorescence spectra have been blank corrected, inner filter corrected, and undergone scatter removal. In addition, this folder contains Matlab code used to make a portion of Figure 1 within the manuscript, derive various spectral parameters used within the manuscript, and used for parallel factor analysis (PARAFAC) modeling. Spectral indices (SPS_SpectralIndices.csv) and PARAFAC outputs (SPS_PARAFAC_Model_Loadings.csv and SPS_PARAFAC_Sample_Scores.csv) are directly read into the associated R-markdown file.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
Welcome to the NBA Statistics Repository for teams and players. This repository contains a rich and diverse dataset spanning from 1996 to 2023, drawn from NBA game statistics. It's ideal for data analysts, basketball fans, researchers, and anyone interested in the detailed numbers behind the sport.
This repository contains a series of CSV files detailing the performances of teams and players from 1996 to 2023. A list of these files is provided below:
player_index.csv: An index of all players with general information.player_stats_advanced_po.csv and player_stats_advanced_rs.csv: Advanced statistics for players during playoffs (po) and regular season (rs).player_stats_defense_po.csv and player_stats_defense_rs.csv: Defensive statistics for players during the playoffs and regular season.player_stats_misc_po.csv and player_stats_misc_rs.csv: Miscellaneous player statistics for the playoffs and regular season.player_stats_scoring_po.csv and player_stats_scoring_rs.csv: Scoring statistics for players during the playoffs and regular season.player_stats_traditional_po.csv and player_stats_traditionnal_rs.csv: Traditional player statistics during the playoffs and regular season.player_stats_usage_po.csv and player_stats_usage_rs.csv: Player usage statistics during the playoffs and regular season.team_stats_advanced_po.csv and team_stats_advanced_rs.csv: Advanced team statistics during the playoffs and regular season.team_stats_defense_po.csv and team_stats_defense_rs.csv: Defensive team statistics during the playoffs and regular season.team_stats_four_factors_po.csv and team_stats_four_factors_rs.csv: Four factors team statistics during the playoffs and regular season.team_stats_misc_po.csv and team_stats_misc_rs.csv: Miscellaneous team statistics during the playoffs and regular season.team_stats_opponent_po.csv and team_stats_opponent_rs.csv: Team opponent statistics during the playoffs and regular season.team_stats_scoring_po.csv and team_stats_scoring_rs.csv: Scoring team statistics during the playoffs and regular season.team_stats_traditional_po.csv and team_stats_traditional_rs.csv: Traditional team statistics during the playoffs and regular season.To use this data, simply clone this repository and use a software capable of reading CSV files, such as Excel, R, Python (with pandas), etc.
Contributions to this repo are welcome. If you have additional data to add or corrections to make, please feel free to open a pull request.
These data are released under the MIT License. See the LICENSE file for more information.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The BioTIME database contains raw data on species identities and abundances in ecological assemblages through time. The database consists of 11 tables; one raw data table plus ten related meta data tables. For further information please see our associated data paper.
This data consists of several elements:
BioTIMESQL_02_04_2018.sql - an SQL file for the full public version of BioTIME which can be imported into any mySQL database.
BioTIMEQuery_02_04_2018.csv - data file, although too large to view in Excel, this can be read into several software applications such as R or various database packages.
BioTIMEMetadata_02_04_2018.csv - file containing the meta data for all studies.
BioTIMECitations_02_04_2018.csv - file containing the citation list for all studies.
BioTIMECitations_02_04_2018.xlsx - file containing the citation list for all studies (some special characters are not supported in the csv format).
BioTIMEInteractions_02_04_2018.Rmd - an r markdown page providing a brief overview of how to interact with the database and associated .csv files (this will not work until field paths and database connections have been added/updated).
Please note: any users of any of this material should cite the associated data paper in addition to the DOI listed here.
To cite the data paper use the following:
Dornelas M, Antão LH, Moyes F, Bates, AE, Magurran, AE, et al. BioTIME: A database of biodiversity time series for the Anthropocene. Global Ecol Biogeogr. 2018; 27:760 - 786. https://doi.org/10.1111/geb.12729
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and scripts for reproducing the analyses of Naranjo-Orrico et al, "The importance of within-log sampling replication in bark- and wood-inhabiting fungal metabarcoding studies".
The input data consists of the following four files, "Alldata.Rdata", "data_SbVenn_meta&morpho.Rdata" , "Xmorpho.csv" and "Ymorpho_1.csv". The former two files are in R format and the latter two in CSV format. The R files need to be loaded using the function load, and the CSV files with the function read.csv2 in R.
"Alldata.Rdata" includes in total 15 input data matrices:
- Metadata for dataset A (meta22)
- Metadata for dataset B (meta23)
- Two sample x OTU tables for dataset A including the number of reads for each OTU (otu.table.plausible.2022 for the plausible OTU taxonomic identifications and otu.table.reliable.2022 for the reliable taxonomic identifications).
- Two sample x OTU tables for dataset B including the number of reads for each OTU (otu.table.plausible.2023 for the plausible OTU taxonomic identifications and otu.table.reliable.2023 for the reliable OTU taxonomic identifications).
- Two sample x OTU tables for dataset A including the relative read counts per OTU (otu.table.plausible.w.2022 and otu.table.reliable.w.2022).
- Two sample x OTU tables for dataset B including the relative read counts per OTU (otu.table.plausible.w.2023 and otu.table.reliable.w.2023).
- Read counts per sample during the different phases of the bioinformatics pipeline for dataset A (read.counts.plausible.2022) and for dataset B (read.counts.plausible.2023).
- Taxonomic information at all taxonomic levels (i.e., form species to phylum) of the identified OTUs (taxonomy.plausible)
- Guild assignments matrices for dataset A (Guilds_plausible_tax_2022) and for dataset B Guilds_plausible_tax_2023).
"data_SbVenn_meta&morpho.Rdata" contains four matrices:
- Occurrence of the lichenized OTUs identified through metabarcoding including identifications at any taxonomic level (i.e., genus or family levels when species level identifications were not achieved) (SbVenn_Lmeta).
- Occurrence of the lichenized OTUs identified through metabarcoding including identifications at the species-only level (SB_Venn_clean_meta).
- Occurrence of the morphologically identified lichenized fungi, including identifications at the genus level and morphospecies (SbVenn_Lmorpho)
- Occurrences of the morphologically identified lichenized fungi, including identifications at the species-only level (SbVenn_clean_morpho).
"Xmorpho.csv" and "Ymorpho_1.csv" contain respectively the metadata and the presence-absence data of the morphologically identified lichens.
“Alldata.Rdata” is used in all the scripts, "data_SbVenn_meta&morpho.Rdata" is only needed for the script "S8_Venn Diagrams.R", and the the files "Xmorpho.csv" and "Ymorpho_1.csv" are used in "S11_Meta vs Morpho species richnes between tree sp and tree part.R".
The statistical analyses consist of joint species distribution modelling with the package Hmsc, generalized linear mixed models (GLMM) with the package glmer, and non-metric multidimensional scaling analysis (NMDS) with the package vegan. To perform the HMSC analyses, the first FOUR scripts need to be run consecutively from S1 (A and B) to S3. S1A defines the first model using data A, and S1B defines the second model using dataset B. S2 fits the models fitted in the study (which include presence-absence with a different set of explanatory variables). S3 shows the parameter estimates from the fitted models, in particular the beta parameters and the variance partitioning across environmental covariates. For fitting and showing the outputs of the GLMM models only S4 is needed. In S5, runs the NMDS analyses. The rest of the scripts, S6-S11 are used to produce the different plots shown in the study of Naranjo-Orrico et al., including pieplots, boxplots, barplots, and Vennplots.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset generated for the "On server-side file access pattern matching" paper (Boito et al., HPCS 2019).
The traces were obtained following the methodology described in the paper. In addition to the two data sets discussed in the paper, we are also making available an extra data set of server traces.
Traces from I/O nodes
Each line of a trace looks like this:
277004729325 00000000eaffffffffffff1f729db77200000000000000000000000000000000 W 0 262144
The first number is an internal timestamp in nanoseconds, the second value is the file handle, and the third is the type of the request (inverted, "W" for reads and "R" for writes). The last two numbers give the request offset and size in bytes, respectively.
Traces from parallel file sytem data servers
These traces are inside the server_traces/ folder. Each experiment has two concurrent applications, "app1" and "app2", and its traces are inside a folder named accordingly:
NOOP\_app1\_(identification of app1)\_app2\_(identification of app2)\_(repetition)\_pvfstrace/
Each application is identified by:
(contig/noncontig)\_(number and size of requests per process)\_(number of processes)\_(number of client machines)\_(nto1/nton regarding the number of files)
Inside each folder there are eight trace files, two per data server, one for the read portion and another for the write portion. Each line looks like this:
[D 02:54:58.386900] REQ SCHED SCHEDULING, handle: 5764607523034231596, queue_element: 0x2a11360, type: 0, offset: 458752, len: 32768
The part between [] is a timestamp, "handle" gives the file handle, "type" is 0 for reads and 1 for writes, "offset" and "len" (length) are in bytes.
Extra traces from data servers
These traces were not used for the paper because we do not have performance measurements for them with different scheduling policies, so it would not be possible to estimate the results of using the pattern matching approach to select scheduling policies. Still, we share them in the extra_server_traces/ folder in the hope they will be useful. They were obtained in the same experimental campaign than the other data server traces, and have the same format. The difference is that these traces are for single-application scenarios.
Facebook
TwitterThis dataset contains all the data and code needed to reproduce the analyses in the manuscript: Penn, H. J., & Read, Q. D. (2023). Stem borer herbivory dependent on interactions of sugarcane variety, associated traits, and presence of prior borer damage. Pest Management Science. https://doi.org/10.1002/ps.7843 Included are two .Rmd notebooks containing all code required to reproduce the analyses in the manuscript, two .html file of rendered notebook output, three .csv data files that are loaded and analyzed, and a .zip file of intermediate R objects that are generated during the model fitting and variable selection process. Notebook files 01_boring_analysis.Rmd: This RMarkdown notebook contains R code to read and process the raw data, create exploratory data visualizations and tables, fit a Bayesian generalized linear mixed model, extract output from the statistical model, and create graphs and tables summarizing the model output including marginal means for different varieties and contrasts between crop years. 02_trait_covariate_analysis.Rmd: This RMarkdown notebook contains R code to read raw variety-level trait data, perform feature selection based on correlations between traits, fit another generalized linear mixed model using traits as predictors, and create graphs and tables from that model output including marginal means by categorical trait and marginal trends by continuous trait. HTML files These HTML files contain the rendered output of the two RMarkdown notebooks. They were generated by Quentin Read on 2023-08-30 and 2023-08-15. 01_boring_analysis.html 02_trait_covariate_analysis.html CSV data files These files contain the raw data. To recreate the notebook output the CSV files should be at the file path project/data/ relative to where the notebook is run. Columns are described below. BoredInternodes_26April2022_no format.csv: primary data file with sugarcane borer (SCB) damage Columns A-C are the year, date, and _location. All _location values are the same. Column D identifies which experiment the data point was collected from. Column E, Stubble, indicates the crop year (plant cane or first stubble) Column F indicates the variety Column G indicates the plot (integer ID) Column H indicates the stalk within each plot (integer ID) Column I, # Internodes, indicates how many internodes were on the stalk Columns J-AM are numbered 1-30 and indicate whether SCB damage was observed on that internode (0 if no, 1 if yes, blank cell if that internode was not present on the stalk) Column AN indicates the experimental treatment for those rows that are part of a manipulative experiment Column AO contains notes variety_lookup.csv: summary information for the 16 varieties analyzed in this study Column A is the variety name Column B is the total number of stalks assessed for SCB damage for that variety across all years Column C is the number of years that variety is present in the data Column D, Stubble, indicates which crop years were sampled for that variety ("PC" if only plant cane, "PC, 1S" if there are data for both plant cane and first stubble crop years) Column E, SCB resistance, is a categorical designation with four values: susceptible, moderately susceptible, moderately resistant, resistant Column F is the literature reference for the SCB resistance value Select_variety_traits_12Dec2022.csv: variety-level traits for the 16 varieties analyzed in this study Column A is the variety name Column B is the SCB resistance designation as an integer Column C is the categorical SCB resistance designation (see above) Columns D-I are continuous traits from year 1 (plant cane), including sugar (Mg/ha), biomass or aboveground cane production (Mg/ha), TRS or theoretically recoverable sugar (g/kg), stalk weight of individual stalks (kg), stalk population density (stalks/ha), and fiber content of stalk (percent). Columns J-O are the same continuous traits from year 2 (first stubble) Columns P-V are categorical traits (in some cases continuous traits binned into categories): maturity timing, amount of stalk wax, amount of leaf sheath wax, amount of leaf sheath hair, tightness of leaf sheath, whether leaf sheath becomes necrotic with age, and amount of collar hair. ZIP file of intermediate R objects To recreate the notebook output without having to run computationally intensive steps, unzip the archive. The fitted model objects should be at the file path project/ relative to where the notebook is run. intermediate_R_objects.zip: This file contains intermediate R objects that are generated during the model fitting and variable selection process. You may use the R objects in the .zip file if you would like to reproduce final output including figures and tables without having to refit the computationally intensive statistical models. binom_fit_intxns_updated_only5yrs.rds: fitted brms model object for the main statistical model binom_fit_reduced.rds: fitted brms model object for the trait covariate analysis marginal_trends.RData: calculated values of the estimated marginal trends with respect to year and previous damage marginal_trend_trs.rds: calculated values of the estimated marginal trend with respect to TRS marginal_trend_fib.rds: calculated values of the estimated marginal trend with respect to fiber content Resources in this dataset:Resource Title: Sugarcane borer damage data by internode, 1993-2021. File Name: BoredInternodes_26April2022_no format.csvResource Title: Summary information for the 16 sugarcane varieties analyzed. File Name: variety_lookup.csvResource Title: Variety-level traits for the 16 sugarcane varieties analyzed. File Name: Select_variety_traits_12Dec2022.csvResource Title: RMarkdown notebook 2: trait covariate analysis. File Name: 02_trait_covariate_analysis.RmdResource Title: Rendered HTML output of notebook 2. File Name: 02_trait_covariate_analysis.htmlResource Title: RMarkdown notebook 1: main analysis. File Name: 01_boring_analysis.RmdResource Title: Rendered HTML output of notebook 1. File Name: 01_boring_analysis.htmlResource Title: Intermediate R objects. File Name: intermediate_R_objects.zip
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data publication contains data and scripts associated with the publication:
Tebbett SB, Connolly SR, Bellwood DR. Benthic composition changes on coral reefs at global scales. Nature Ecology and Evolution
In this study we wanted to gain an insight into likely coral reef configurations of the near future. Specifically we focused on two relatively straightforward questions: 1) how is the benthic composition of coral reefs changing at a global scale? and 2) how, and to what extent, do these changes vary among major marine realms? To explore these questions, we compiled an extensive, global, dataset composed of observations of coral reef benthic composition (cover of 6 benthic categories).
The individual datapoints in our dataset were mean site level (i.e. a unique latitude and longitude) benthic community composition data. The six benthic categories were: hard coral, soft corals, macroalgae, low lying algae, sand, and other. We sourced all data from publicly availably databases (all references are supplied in CSV file 1) and previous literature (references and data supplied in CSV files 2 and 3). Specifically, we compiled benthic composition data from six major publicly available monitoring databases: Reef Check, Reef Check Australia, Reef Life Survey, Caribbean Coastal Marine Productivity (CARICOMP), Moorea Coral Reef Long Term Ecological Research and the National Oceanic and Atmospheric Administration (NOAA). The raw data from these six publicly available databases can be accessed through the list of references and relevant accession details listed in CSV file 1. There are 48 references listed in this file as these detail the specific location of each individual data source within these broader databases to ensure raw data can be easily located. To complement the data from these databases and to ensure that our dataset was comprehensive, we then undertook an extensive formal search of the literature for available data. Ultimately, this search process resulted in 83 publications with data that could be used in our study. The references and derived data from these 83 studies is provided in CSV files 2 and 3.
This data publication also includes five R scripts (R markdown files) that detail the statistical analyses and compilation of figures from the main text in the manuscript.
For full methodological details, please see the published manuscript (referenced above).
Further details of the files associated with this data publication can also be found in the 'read me' file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset includes all the data and R code needed to reproduce the analyses in a forthcoming manuscript:Copes, W. E., Q. D. Read, and B. J. Smith. Environmental influences on drying rate of spray applied disinfestants from horticultural production services. PhytoFrontiers, DOI pending.Study description: Instructions for disinfestants typically specify a dose and a contact time to kill plant pathogens on production surfaces. A problem occurs when disinfestants are applied to large production areas where the evaporation rate is affected by weather conditions. The common contact time recommendation of 10 min may not be achieved under hot, sunny conditions that promote fast drying. This study is an investigation into how the evaporation rates of six commercial disinfestants vary when applied to six types of substrate materials under cool to hot and cloudy to sunny weather conditions. Initially, disinfestants with low surface tension spread out to provide 100% coverage and disinfestants with high surface tension beaded up to provide about 60% coverage when applied to hard smooth surfaces. Disinfestants applied to porous materials were quickly absorbed into the body of the material, such as wood and concrete. Even though disinfestants evaporated faster under hot sunny conditions than under cool cloudy conditions, coverage was reduced considerably in the first 2.5 min under most weather conditions and reduced to less than or equal to 50% coverage by 5 min. Dataset contents: This dataset includes R code to import the data and fit Bayesian statistical models using the model fitting software CmdStan, interfaced with R using the packages brms and cmdstanr. The models (one for 2022 and one for 2023) compare how quickly different spray-applied disinfestants dry, depending on what chemical was sprayed, what surface material it was sprayed onto, and what the weather conditions were at the time. Next, the statistical models are used to generate predictions and compare mean drying rates between the disinfestants, surface materials, and weather conditions. Finally, tables and figures are created. These files are included:Drying2022.csv: drying rate data for the 2022 experimental runWeather2022.csv: weather data for the 2022 experimental runDrying2023.csv: drying rate data for the 2023 experimental runWeather2023.csv: weather data for the 2023 experimental rundisinfestant_drying_analysis.Rmd: RMarkdown notebook with all data processing, analysis, and table creation codedisinfestant_drying_analysis.html: rendered output of notebookMS_figures.R: additional R code to create figures formatted for journal requirementsfit2022_discretetime_weather_solar.rds: fitted brms model object for 2022. This will allow users to reproduce the model prediction results without having to refit the model, which was originally fit on a high-performance computing clusterfit2023_discretetime_weather_solar.rds: fitted brms model object for 2023data_dictionary.xlsx: descriptions of each column in the CSV data files
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
# Annotated 12 lead ECG dataset Contain 827 ECG tracings from different patients, annotated by several cardiologists, residents and medical students. It is used as test set on the paper: "Automatic Diagnosis of the Short-Duration12-Lead ECG using a Deep Neural Network". It contain annotations about 6 different ECGs abnormalities: - 1st degree AV block (1dAVb); - right bundle branch block (RBBB); - left bundle branch block (LBBB); - sinus bradycardia (SB); - atrial fibrillation (AF); and, - sinus tachycardia (ST). ## Folder content: - `ecg_tracings.hdf5`: HDF5 file containing a single dataset named `tracings`. This dataset is a `(827, 4096, 12)` tensor. The first dimension correspond to the 827 different exams from different patients; the second dimension correspond to the 4096 signal samples; the third dimension to the 12 different leads of the ECG exam. The signals are sampled at 400 Hz. Some signals originally have a duration of 10 seconds (10 * 400 = 4000 samples) and others of 7 seconds (7 * 400 = 2800 samples). In order to make them all have the same size (4096 samples) we fill them with zeros on both sizes. For instance, for a 7 seconds ECG signal with 2800 samples we include 648 samples at the beginning and 648 samples at the end, yielding 4096 samples that are them saved in the hdf5 dataset. All signal are represented as floating point numbers at the scale 1e-4V: so it should be multiplied by 1000 in order to obtain the signals in V. In python, one can read this file using the following sequence: ```python import h5py with h5py.File(args.tracings, "r") as f: x = np.array(f['tracings']) ``` - The file `attributes.csv` contain basic patient attributes: sex (M or F) and age. It contain 827 lines (plus the header). The i-th tracing in `ecg_tracings.hdf5` correspond to the i-th line. - `annotations/`: folder containing annotations csv format. Each csv file contain 827 lines (plus the header). The i-th line correspond to the i-th tracing in `ecg_tracings.hdf5` correspond to the in all csv files. The csv files all have 6 columns `1dAVb, RBBB, LBBB, SB, AF, ST` corresponding to weather the annotator have detect the abnormality in the ECG (`=1`) or not (`=0`). 1. `cardiologist[1,2].csv` contain annotations from two different cardiologist. 2. `gold_standard.csv` gold standard annotation for this test dataset. When the cardiologist 1 and cardiologist 2 agree, the common diagnosis was considered as gold standard. In cases where there was any disagreement, a third senior specialist, aware of the annotations from the other two, decided the diagnosis. 3. `dnn.csv` prediction from the deep neural network described in "Automatic Diagnosis of the Short-Duration 12-Lead ECG using a Deep Neural Network". The threshold is set in such way it maximizes the F1 score. 4. `cardiology_residents.csv` annotations from two 4th year cardiology residents (each annotated half of the dataset). 5. `emergency_residents.csv` annotations from two 3rd year emergency residents (each annotated half of the dataset). 6. `medical_students.csv` annotations from two 5th year medical students (each annotated half of the dataset).