Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When studying the impacts of climate change, there is a tendency to select climate data from a small set of arbitrary time periods or climate windows (e.g., spring temperature). However, these arbitrary windows may not encompass the strongest periods of climatic sensitivity and may lead to erroneous biological interpretations. Therefore, there is a need to consider a wider range of climate windows to better predict the impacts of future climate change. We introduce the R package climwin that provides a number of methods to test the effect of different climate windows on a chosen response variable and compare these windows to identify potential climate signals. climwin extracts the relevant data for each possible climate window and uses this data to fit a statistical model, the structure of which is chosen by the user. Models are then compared using an information criteria approach. This allows users to determine how well each window explains variation in the response variable and compare model support between windows. climwin also contains methods to detect type I and II errors, which are often a problem with this type of exploratory analysis. This article presents the statistical framework and technical details behind the climwin package and demonstrates the applicability of the method with a number of worked examples.
Exploratory Data Analysis for the Physical Properties of Lakes
This lesson was adapted from educational material written by Dr. Kateri Salk for her Fall 2019 Hydrologic Data Analysis course at Duke University. This is the first part of a two-part exercise focusing on the physical properties of lakes.
Introduction
Lakes are dynamic, nonuniform bodies of water in which the physical, biological, and chemical properties interact. Lakes also contain the majority of Earth's fresh water supply. This lesson introduces exploratory data analysis using R statistical software in the context of the physical properties of lakes.
Learning Objectives
After successfully completing this exercise, you will be able to:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The high-resolution and mass accuracy of Fourier transform mass spectrometry (FT-MS) has made it an increasingly popular technique for discerning the composition of soil, plant and aquatic samples containing complex mixtures of proteins, carbohydrates, lipids, lignins, hydrocarbons, phytochemicals and other compounds. Thus, there is a growing demand for informatics tools to analyze FT-MS data that will aid investigators seeking to understand the availability of carbon compounds to biotic and abiotic oxidation and to compare fundamental chemical properties of complex samples across groups. We present ftmsRanalysis, an R package which provides an extensive collection of data formatting and processing, filtering, visualization, and sample and group comparison functionalities. The package provides a suite of plotting methods and enables expedient, flexible and interactive visualization of complex datasets through functions which link to a powerful and interactive visualization user interface, Trelliscope. Example analysis using FT-MS data from a soil microbiology study demonstrates the core functionality of the package and highlights the capabilities for producing interactive visualizations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The zipped file contains the following: - data (as csv, in the 'data' folder), - R scripts (as Rmd, in the rro folder), - figures (as pdf, in the 'figs' folder), and - presentation (as html, in the root folder).
https://www.marketreportanalytics.com/privacy-policyhttps://www.marketreportanalytics.com/privacy-policy
The Exploratory Data Analysis (EDA) tools market is experiencing robust growth, driven by the increasing volume and complexity of data across industries. The rising need for data-driven decision-making, coupled with the expanding adoption of cloud-based analytics solutions, is fueling market expansion. While precise figures for market size and CAGR are not provided, a reasonable estimation, based on the prevalent growth in the broader analytics market and the crucial role of EDA in the data science workflow, would place the 2025 market size at approximately $3 billion, with a projected Compound Annual Growth Rate (CAGR) of 15% through 2033. This growth is segmented across various applications, with large enterprises leading the adoption due to their higher investment capacity and complex data needs. However, SMEs are witnessing rapid growth in EDA tool adoption, driven by the increasing availability of user-friendly and cost-effective solutions. Further segmentation by tool type reveals a strong preference for graphical EDA tools, which offer intuitive visualizations facilitating better data understanding and communication of findings. Geographic regions, such as North America and Europe, currently hold a significant market share, but the Asia-Pacific region shows promising potential for future growth owing to increasing digitalization and data generation. Key restraints to market growth include the need for specialized skills to effectively utilize these tools and the potential for data bias if not handled appropriately. The competitive landscape is dynamic, with both established players like IBM and emerging companies specializing in niche areas vying for market share. Established players benefit from brand recognition and comprehensive enterprise solutions, while specialized vendors provide innovative features and agile development cycles. Open-source options like KNIME and R packages (Rattle, Pandas Profiling) offer cost-effective alternatives, particularly attracting academic institutions and smaller businesses. The ongoing development of advanced analytics functionalities, such as automated machine learning integration within EDA platforms, will be a significant driver of future market growth. Further, the integration of EDA tools within broader data science platforms is streamlining the overall analytical workflow, contributing to increased adoption and reduced complexity. The market's evolution hinges on enhanced user experience, more robust automation features, and seamless integration with other data management and analytics tools.
http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
This is version v3.2.0.2021f of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data.
The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information.
The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20220101_v3.2.1.2021f.nc. The station codes can be found under the docs tab. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height.
To keep informed about updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS.
For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/
References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) :
Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note.
Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016.
Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012
Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1
For a homogeneity assessment of HadISD please see this following reference
Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Reddit is a massive platform for news, content, and discussions, hosting millions of active users daily. Among its vast number of subreddits, we focus on the r/AskScience community, where users engage in science-related discussions and questions.
This dataset is derived from the r/AskScience subreddit, collected between January 1, 2016, and May 20, 2022. It includes 612,668 datapoints across 22 columns, featuring diverse information such as the content of the questions, submission descriptions, associated flairs, NSFW/SFW status, year of submission, and more. The data was extracted using Python and Pushshift's API, followed by some cleaning with NumPy and pandas. Detailed column descriptions are available for clarity.
Predavanje za predmet Tehnike obrade biomedicinskih signala na master akademskim studijama na Elektrotehničkom fakultetu Univerziteta u Beogradu.
This is version 2.0.2.2017p of Met Office Hadley Centre's Integrated Surface Database, HadISD. These data are global sub-daily surface meteorological data that extends HadISD v2.0.1.2016f to include 2017 and so spans 1931-2017. These data include an update to the station selected and contain 8103 stations. These are the preliminary data for this version, a finalised version will be released in a few months with any station updates. The quality controlled variables in this dataset are: temperature, dewpoint temperature, sea-level pressure, wind speed and direction, cloud data (total, low, mid and high level). Past significant weather and precipitation data are also included, but have not been quality controlled, so their quality and completeness cannot be guaranteed. Quality control flags and data values which have been removed during the quality control process are provided in the qc_flags and flagged_values fields, and ancillary data files show the station listing with a station listing with IDs, names and location information. The data are provided as one NetCDF file per station. Files in the station_data folder station data files have the format "station_code"_HadISD_HadOBS_19310101-20171231_v2-0-2-2017p.nc. The station codes can be found under the docs tab or on the archive beside the station_data folder. The station codes file has five columns as follows: 1) station code, 2) station name 3) station latitude 4) station longitude 5) station height. To keep up to date with updates, news and announcements follow the HadOBS team on twitter @metofficeHadOBS. For more detailed information e.g bug fixes, routine updates and other exploratory analysis, see the HadISD blog: http://hadisd.blogspot.co.uk/ References: When using the dataset in a paper you must cite the following papers (see Docs for link to the publications) and this dataset (using the "citable as" reference) : Dunn, R. J. H., Willett, K. M., Parker, D. E., and Mitchell, L.: Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geosci. Instrum. Method. Data Syst., 5, 473-491, doi:10.5194/gi-5-473-2016, 2016. Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Clim. Past, 8, 1649-1679, 2012, doi:10.5194/cp-8-1649-2012 Smith, A., N. Lott, and R. Vose, 2011: The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704–708, doi:10.1175/2011BAMS3015.1 For a homogeneity assessment of HadISD please see this following reference Dunn, R. J. H., K. M. Willett, C. P. Morice, and D. E. Parker. "Pairwise homogeneity assessment of HadISD." Climate of the Past 10, no. 4 (2014): 1501-1522. doi:10.5194/cp-10-1501-2014, 2014.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Instead of a data set, this is an exercise on financial analysis made with R vectors. The intention is to demonstrate the practicality of this tool when interpreting vector calculations, matrices or data frames. This is due to its character of being a high-level language, which allows algorithms to be expressed in a way that is appropriate to human cognitive capacity, instead of the capacity with which machines execute them.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present an R-package for predictive modelling, CARRoT (Cross-validation, Accuracy, Regression, Rule of Ten). CARRoT is a tool for initial exploratory analysis of the data, which performs exhaustive search for a regression model yielding the best predictive power with heuristic ‘rules of thumb’ and expert knowledge as regularization parameters. It uses multiple hold-outs in order to internally validate the model. The package allows to take into account multiple factors such as collinearity of the predictors, event per variable rules (EPVs) and R-squared statistics during the model selection. In addition, other constraints, such as forcing specific terms and restricting complexity of the predictive models can be used. The package allows taking pairwise and three-way interactions between variables into account as well. These candidate models are then ranked by predictive power, which is assessed via multiple hold-out procedures and can be parallelised in order to reduce the computational time. Models which exhibited the highest average predictive power over all hold-outs are returned. This is quantified as absolute and relative error in case of continuous outcomes, accuracy and AUROC values in case of categorical outcomes. In this paper we briefly present statistical framework of the package and discuss the complexity of the underlying algorithm. Moreover, using CARRoT and a number of datasets available in R we provide comparison of different model selection techniques: based on EPVs alone, on EPVs and R-squared statistics, on lasso regression, on including only statistically significant predictors and on stepwise forward selection technique.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Reddit is a social news, content rating and discussion website. It's one of the most popular sites on the internet. Reddit has 52 million daily active users and approximately 430 million users who use it once a month. Reddit has different subreddits and here We'll use the r/AskScience Subreddit.
The dataset is extracted from the subreddit /r/AskScience from Reddit. The data was collected between 01-01-2016 and 20-05-2022. It contains 612,668 Datapoints and 25 Columns. The database contains a number of information about the questions asked on the subreddit, the description of the submission, the flair of the question, NSFW or SFW status, the year of the submission, and more. The data is extracted using python and Pushshift's API. A little bit of cleaning is done using NumPy and pandas as well. (see the descriptions of individual columns below).
The dataset contains the following columns and descriptions: author - Redditor Name author_fullname - Redditor Full name contest_mode - Contest mode [implement obscured scores and randomized sorting]. created_utc - Time the submission was created, represented in Unix Time. domain - Domain of submission. edited - If the post is edited or not. full_link - Link of the post on the subreddit. id - ID of the submission. is_self - Whether or not the submission is a self post (text-only). link_flair_css_class - CSS Class used to identify the flair. link_flair_text - Flair on the post or The link flair’s text content. locked - Whether or not the submission has been locked. num_comments - The number of comments on the submission. over_18 - Whether or not the submission has been marked as NSFW. permalink - A permalink for the submission. retrieved_on - time ingested. score - The number of upvotes for the submission. description - Description of the Submission. spoiler - Whether or not the submission has been marked as a spoiler. stickied - Whether or not the submission is stickied. thumbnail - Thumbnail of Submission. question - Question Asked in the Submission. url - The URL the submission links to, or the permalink if a self post. year - Year of the Submission. banned - Banned by the moderator or not.
This dataset can be used for Flair Prediction, NSFW Classification, and different Text Mining/NLP tasks. Exploratory Data Analysis can also be done to get the insights and see the trend and patterns over the years.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Data Visualization
a. Scatter plot
i. The webapp should allow the user to select genes from datasets and plot 2D scatter plots between 2 variables(expression/copy_number/chronos) for
any pair of genes.
ii. The user should be able to filter and color data points using metadata information available in the file “metadata.csv”.
iii. The visualization could be interactive - It would be great if the user can hover over the data-points on the plot and get the relevant information (hint -
visit https://plotly.com/r/, https://plotly.com/python)
iv. Here is a quick reference for you. The scatter plot is between chronos score for TTBK2 gene and expression for MORC2 gene with coloring defined by
Gender/Sex column from the metadata file.
b. Boxplot/violin plot
i. User should be able to select a gene and a variable (expression / chronos / copy_number) and generate a boxplot to display its distribution across
multiple categories as defined by user selected variable (a column from the metadata file)
ii. Here is an example for your reference where violin plot for CHRONOS score for gene CCL22 is plotted and grouped by ‘Lineage’
We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.
Your data will be in front of the world's largest data science community. What questions do you want to see answered?
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Sleep deprivation affects cognitive performance and immune function, yet its mechanisms and biomarkers remain unclear. This study explored the relationships among gene expression, brain metabolism, sleep deprivation, and sex differences. Methods Fluorodeoxyglucose-18 positron emission tomography (18F-FDG PET) measured brain metabolism in regions of interest (ROIs), and RNA analysis of blood samples assessed gene expression pre- and post-sleep deprivation. Mixed model regression and principal component analysis (PCA) identified significant genes and regional metabolic changes. Results There were 23 and 28 differentially expressed probesets for the main effects of sex and sleep deprivation, respectively, and 55 probesets for their interaction (FDR-corrected p<0.05). Functional analysis revealed enrichment in nucleoplasm- and UBL conjugation-related genes. Genes showing significant sex effects mapped to chromosomal regions Y and 19 (Benjamini-Hochberg (BH) FDR p<0.05), with 11 genes (4%) and 29 genes (10.5%) involved, respectively. Differential gene expression highlighted sex-based differences in innate and adaptive immunity. For brain metabolism, sleep deprivation resulted in significant decreases in the left insula, medial prefrontal cortex (BA32), somatosensory cortex (BA1/2), and motor premotor cortex (BA6) and increases in the right inferior longitudinal fasciculus, primary visual cortex (BA17), amygdala, cerebellum, and bilateral pons. Hemispheric asymmetry in brain metabolism was observed, with BA6 decreases correlating with increased UBL conjugation gene expression. Conclusion Sleep deprivation broadly impacts brain metabolism, gene expression, and immune function, revealing cellular stress responses and hemispheric vulnerability. These findings enhance understanding of the molecular and functional effects of sleep deprivation. Methods Sleep Deprivation Eight healthy subjects, 4 male and 4 female, were recruited from the University of California Irvine, after IRB approval. On day 1, subjects were initially assigned a 24-hour period of normal activity (e.g. walk, talk, study, watch TV, play games, use the computer, etc.). These subjects were tested on the Psychomotor Vigilance Test (PVT) and asked to rate their subjective level of sleepiness on the Stanford Sleepiness Scale (SSS) at baseline. Higher scores indicate a longer, more delayed, response time on the PVT, while higher scores on the SSS indicate greater degrees of sleepiness. The SSS scale is shown in Table 1. Each subject’s performance on the Psychomotor Vigilance Test (PVT), and subjective sleepiness ratings (SSS) were recorded both before and after sleep deprivation (Table 2). There was no significant difference in age between male and female subjects (Table 3), all of whom had no prior psychiatric history. Blood samples were collected on baseline day at 1 p.m, pre-sleep deprivation (pre-SD). Sleep deprivation activities and blood sample acquisition times are recorded in Table 4. At the end of day 1 (11 p.m), subjects were moved to an outpatient research facility for the sleep deprivation protocol. They were requested not to nap or sleep during the sleep deprivation period, and were additionally tasked with filling out forms and answering questions about their mood every two to four hours. Staff members monitored the subjects during the sleep deprivation period. Subjects were allowed to walk, talk, study, watch TV, play games or cards, read, and use the computer, but were not allowed caffeinated foods or beverages. A second blood sample was collected 18 hours after starting sleep deprivation activities (SD Day 2, 1 p.m), subjects completed the protocol and were driven home by cab. Gene Data Processing Blood samples (3 ml) were drawn from each subject, into Tempus tubes (ABI, ThermoFisher, Carlsbad, CA) 24 hours apart. The blood samples collected at baseline and 18 hours after starting sleep deprivation activities were processed with Affymetrix HG-U133 Plus 2.0 gene expression microarray chips according to the manufacturer’s instructions (Affymetrix, ThermoFisher, Carlsbad CA). Data processing was done using R 4.2 and BioConductor 3.16 [32]. The Affymetrix HG-U133 Plus 2.0 microarray ‘cel’ files were read using the affy routine with the hgu133plus2.db package. Quantile normalization was used to standardize probeset data [33]. A linear model was fitted to the expression data for each probeset using ‘lmfit’ from the limma package, to eliminate weakly expressed probesets, and the top 40,000 probesets were found using the topTables function. Type III mixed ANOVA was implemented using the lmerTest library in R, with the main effects being sex, sleep deprivation, and sleep deprivation-sex interaction. Age and RNA integrity number (RIN) were used as covariates. The top 300 probesets for each main effect from mixed ANOVA and PCA were analyzed for enrichment using the Database for Annotation, Visualization and Integrated Discovery (DAVID) [34; 35]. Principal component analysis was conducted using the pca function with normalized and scaled expression data. F18-FDG PET Scan Processing The pre-SD and post-SD F18 FDG-PET scans were obtained from each subject. Each F18-FDG PET scan was normalized in MATLAB (Mathworks, Sherborn, Massachusetts, USA) using Statistical Parametric Mapping (SPM) 5 software (Functional Imaging Laboratory, Wellcome Department of Cognitive Neurology, University College London, London, UK) to spatially transform the images to a template conforming to the space derived from standard brains from the Montreal Neurological Institute, and convert it to the space of the stereotactic atlas of Talairach and Tournoux. The images were then smoothed with a Gaussian low-pass filter of 8mm to minimize noise and improve spatial alignment. Regions of interest (ROI) analysis was done by extracting metabolic values from regions of interest using VINCI (“Volume Imaging in Neurological Research, Co-Registration and ROI included”) software. Supplementary Figure 1 shows ROI segmentation of FDG-PET scans labeled with brain regions and Brodmann areas (BA). A type III mixed two-way ANOVA was implemented using the lmerTest library in R. The model considered sex as a between-subjects factor and condition (pre-sleep deprivation vs. post-sleep deprivation) as a within-subjects factor. Principal component analysis was performed using the pca() function in the BioConductor environment [32] in R. Prior to extracting principal components, all probesets were scaled by extracting the mean value and dividing by the standard deviation for that variable in R.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset simulates health-related outputs from a smartwatch, mimicking real-world issues in data collection, making it perfect for applying data preprocessing techniques such as handling missing values, outliers, duplicates, and inconsistencies.
Dataset Overview: Total Rows: 10,000 Total Columns: 7 Use Case: Health monitoring using smartwatch sensor data
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This package contains data and code to replicate the findings presented in our paper titled "GUI Testing of Android Applications: Investigating the Impact of the Number of Testers on Different Exploratory Testing Strategies".
Abstract
Graphical User Interface (GUI) testing plays a pivotal role in ensuring the quality and functionality of mobile apps. In this context, Exploratory Testing (ET), a distinctive methodology in which individual testers pursue a creative, and experience-based approach to test design, is often used as an alternative or in addition to traditional scripted testing. Managing the exploratory testing process is a challenging task, that can easily result either in wasteful spending or in inadequate software quality, due to the relative unpredictability of exploratory testing activities, which depend on the skills and abilities of individual testers. A number of works have investigated the diversity of testers’ performance when using ET strategies, often in a crowdtesting setting. These works, however, investigated ET effectiveness in detecting bugs, and not in scenarios in which the goal is to generate a re-executable test suite, as well. Moreover, less work has been conducted on evaluating the impact of adopting different exploratory testing strategies. As a first step towards filling this gap in the literature, in this work we conduct an empirical evaluation involving four open-source Android apps and twenty masters students, that we believe can be representative of practitioners partaking in exploratory testing activities. The students were asked to generate test suites for the apps using a Capture and Replay tool and different exploratory testing strategies. We then compare the effectiveness, in terms of aggregate code coverage, that different-sized groups of students using different exploratory testing strategies may achieve. Results provide deeper insights into code coverage dynamics to project managers interested in using exploratory approaches to test simple Android apps, on which they can make more informed decisions.
Contents and Instructions
This package contains:
apps-under-test.zip A zip archive containing the source code of the four Android applications we considered in our study, namely MunchLife, TippyTipper, Trolly, and SimplyDo.
apps-under-test-instrumented.zip A zip archive containing the instrumented source code of the four Android applications we used to compute branch coverage.
students-test-suites.zip A zip archive containing the test suites developed by the students using Uninformed Exploratory Testing (referred to as "Black Box" in the subdirectories) and Informed Exploratory Testing (referred to as "White Box" in the subdirectories). This also includes coverage reports.
compute-coverage-unions.zip A zip archive containing Python scripts we developed to compute the aggregate LOC coverage of all possible subsets of students. The scripts have been tested on MS Windows. To compute the LOC coverage achieved by any possible subsets of testers using IET and UET strategies, run the analysisAndReport.py script. To compute the LOC coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the analysisAndReport_UET_IET_combinations_emma.py script.
branch-coverage-computation.zip A zip archive containing Python scripts we developed to compute the aggregate branch coverage of all considered subsets of students. The scripts have been tested on MS Windows. To compute the branch coverage achieved by any possible subsets of testers using UET and I+UET strategies, run the branch_coverage_analysis.py script. To compute the code coverage achieved by mixed crowds in which some testers use a U+IET approach and others use a UET approach, run the mixed_branch_coverage_analysis.py script.
data-analysis-scripts.zip A zip archive containing R scripts to merge and manipulate coverage data, to carry out statistical analysis and draw plots. All data concerning RQ1 and RQ2 is available as a ready-to-use R data frame in the ./data/all_coverage_data.rds file. All data concerning RQ3 is available in the ./data/all_mixed_coverage_data.rds file.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The repository provides full data and processing / analysis pipeline for the paper 'From stage to page: language independent bootstrap measures of distinctiveness in fictional speech'
Rendered notebooks are also available through Github:
1) Preparation, energy distance and exploration (main)
2) Keyword curves & formal modeling
00_dracor_get_data.R
. Script uses DraCor dedicated API to get texts spoken by characters
01_distinctiveness_energy.ipynb
does the heavy lifting of data wrangling, cleaning and preprocessing, plus implements energy distance bootstrapping and does exploratory analysis
02_logodds_curves.R
calculates keyword curves for characters
03_analysis_and_models.R
explores keyword curves and does Bayesian models
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
In this study we present the first taxonomic revision of the ant genus Stigmatomma in the Malagasy biogeographic region, redescribe the previously known S. besucheti Baroni-Urbani, and describe seven new species to science (S. bolabola sp. n., S. irayhady sp. n., S. janovitsika sp. n., S. liebe sp. n., S. roahady sp. n., S. sakalava sp. n., and S. tsyhady sp. n.). The revision is based on the worker caste, but we provide brief descriptions of gynes and males for some species. Species descriptions, diagnosis, character discussion, identification key, and glossary are illustrated with 360 high-quality montage and SEM images. The distribution of Stigmatomma species in Madagascar are mapped and discussed within the context of the island's biomes and ecoregions. We also discuss how some morphometric variables describe the differences among the species in the bioregion. Open science is supported by providing access to R scripts, raw measurement data, and all specimen data used. All specimens used in this study were given unique identifies, and holotypes were imaged. Specimens and images are made accessible on AntWeb.org.
Age is an important parameter for bettering the understanding of biodemographic trends-development, survival, reproduction and environmental effects-critical for conservation. However, current age estimation methods are challenging to apply to many species, and no standardised technique has been adopted yet. This study examined the potential use of methylation-sensitive high-resolution melting (MS-HRM), a labour, time, and cost-effective method to estimate chronological age from DNA methylation in Asian elephants (Elephas maximus). The objective of this study was to investigate the accuracy and validation of MS-HRM use for age determination in long-lived species, such as Asian elephants. The average lifespan of Asian elephants is between 50-70 years but some have been known to survive for more than 80 years. DNA was extracted from 53 blood samples of captive Asian elephants across 11 zoos in Japan, with known ages ranging from a few months to 65 years. Methylation rates of two candidate..., , , # Estimation of captive Asian elephants (Elephas maximus) age based on DNA methylation: An exploratory analysis using methylation-sensitive high-resolution melting (MS-HRM)
The raw methylation data of RALYL and TET2 used in this analysis are in csv files.
This is the dataset that we used to develop the age estimation model. The '**subject ID**' represents the same individuals. In contrast, '**sample**' represents the sample collections taken over time. In addition, cells containing '**n/a**' in our dataset within the 'sample' column are samples which were sampled recently and had no define number ID at the time (please refer to the Supplementary Information on the manuscript for more details on sampling date). '**sex**' represents the sex of the individual, where F: female and M: male. '**age**' represents the chronological age of the individual at the time of sampling. '**ralyl_methylationrate_ave**' and '**tet2_methylatio...
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset contains 206 spatially explicit environmental variables, also termed covariates, at 25m resolution that cover the entire Netherlands (national scale). The raster data are comprised of covariates related to the soil-forming factors (climate, organism/land use/land cover, relief/topography, parent material/geology) for the purpose of using them for digital soil mapping. However, since the covariates cover a wide range of environmental variables, they can potentially be used for spatial modelling in the Netherlands also outside the field of soil science. All covariates can also be found from the original source, but the potential strength and practicality of this dataset lies in the broad range of readily available, collected, prepared and harmonized raster data.
The metadata of all the covariates in this dataset can be found in the "00_covariates_metadata.csv" file, including information about the names, category, value types, specific value types, type of geospatial data, file type, whether its static or dynamic, temporal coverage, date/version, resolution (all 25m), origin, source, access/license, description, processing steps and comments. The dataset includes 3 different types of files:
Note that the reclassification tables contain potential ways to reclassify the data provided, but can be altered by the user. Reclassification may be useful for categorical covariates with a large number of classes/categories. Note that covariates with CC BY-ND 4.0 licenses, covariates that are not open data or for which the license was unknown are not shared in this dataset.
More information about these covariates can be found in the associated scientific paper "BIS-4D: Mapping soil properties and their uncertainties at 25m resolution in the Netherlands" (Helfenstein et al., 2024, under review). Different ways of pre-processing and preparing the covariates for subsequent modelling can be found in R scripts 20-25 in the associated code repository on GitLab. This includes assembling and preparing covariates using GDAL ("20_cov_prep_gdal.R"), computing digital elevation model (DEM) derivatives using SAGA GIS ("21_cov_dem_deriv_saga.R"), deriving spectral indices from RGBNIR bands of Sentinel 2 images ("22_cov_sensing_deriv.R"), preparing categorical covariates using GDAL ("23_cov_cat_recl_gdal.R"), deriving dynamic covariates ("24_cov_dyn_prep_gdal.R") and exploratory analysis of the covariates ("25_cov_expl_analysis_clorpt.Rmd", "25_cov_expl_analysis_cont_cat.Rmd").
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
When studying the impacts of climate change, there is a tendency to select climate data from a small set of arbitrary time periods or climate windows (e.g., spring temperature). However, these arbitrary windows may not encompass the strongest periods of climatic sensitivity and may lead to erroneous biological interpretations. Therefore, there is a need to consider a wider range of climate windows to better predict the impacts of future climate change. We introduce the R package climwin that provides a number of methods to test the effect of different climate windows on a chosen response variable and compare these windows to identify potential climate signals. climwin extracts the relevant data for each possible climate window and uses this data to fit a statistical model, the structure of which is chosen by the user. Models are then compared using an information criteria approach. This allows users to determine how well each window explains variation in the response variable and compare model support between windows. climwin also contains methods to detect type I and II errors, which are often a problem with this type of exploratory analysis. This article presents the statistical framework and technical details behind the climwin package and demonstrates the applicability of the method with a number of worked examples.