Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.
Each row in the dataset consists of:
Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:
Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List simulations.R (MD5: bdda5503ab6ec0d1374d340d60f562d6)
Description
This comment used the attached R script to conduct simulation studies of spatial component regression (SCR). The file simulations.R contain all code needed to run the simulations to test SCR performance for three objectives: (1) inference under the null hypothesis; (2) inference when the predictor of inference does have an effect on the outcome; and (3) matrix selections. The code will simulate 16 sets of 1000 data sets each. The 16 data sets represent all possible combinations of 2 different spatial predictor types 4 autocorrelation types and 2 effect sizes for the spatial predictor.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Trade-offs are a fundamental concept in evolutionary biology because they are thought to explain much of nature’s biological diversity, from variation in life-histories to differences in metabolism. Despite the predicted importance of trade-offs, they are notoriously difficult to detect. Here we contribute to the existing rich theoretical literature on trade-offs by examining how the shape of the distribution of resources or metabolites acquired in an allocation pathway influences the strength of trade-offs between traits. We further explore how variation in resource distribution interacts with two aspects of pathway complexity (i.e., the number of branches and hierarchical structure) affects tradeoffs. We simulate variation in the shape of the distribution of a resource by sampling 106 individuals from a beta distribution with varying parameters to alter the resource shape. In a simple “Y-model” allocation of resources to two traits, any variation in a resource leads to slopes less than -1, with left skewed and symmetrical distributions leading to negative relationships between traits, and highly right skewed distributions associated with positive relationships between traits. Adding more branches further weakens negative and positive relationships between traits, and the hierarchical structure of pathways typically weakens relationships between traits, although in some contexts hierarchical complexity can strengthen positive relationships between traits. Our results further illuminate how variation in the acquisition and allocation of resources, and particularly the shape of a resource distribution and how it interacts with pathway complexity, makes it challenging to detect trade-offs. We offer several practical suggestions on how to detect trade-offs given these challenges. Methods Overview of Flux Simulations To study the strength and direction of trade-offs within a population, we developed a simulation of flux in a simple metabolic pathway, where a precursor metabolite emerging from node A may either be converted to metabolic products B1 or B2 (Fig. 1). This conception of a pathway is similar to De Jong and Van Noordwijk’s Y-model (Van Noordwijk & De Jong, 1986; De Jong & Van Noordwijk, 1992), but we used simulation instead of analytical statistical models to allow us to consider greater complexity in the distribution of variables and pathways. For a simple pathway (Fig. 1), the total flux Jtotal (i.e., the flux at node A, denoted as JA) for each individual (N = 106) was first sampled from a predetermined beta distribution as described below. The flux at node B1 (JB1) was then randomly sampled from this distribution with max = Jtotal = JA and min = 0. The flux at the remaining node, B2, was then simply the remaining flux (JB2 = JA - JB1). Simulations of more complex pathways followed the same basic approach as described above, with increased numbers of branches and hierarchical levels added to the pathway as described below under Question 2. The metabolic pathways were simulated using Python (v. 3.8.2) (Van Rossum & Drake Jr., 2009) where we could control the underlying distribution of metabolite allocation. The output flux at nodes B1 and B2 was plotted using R (v. 4.2.1) (Team, 2022) with the resulting trade-off visualized as a linear regression using the ggplot2 R package (v. 3.4.2) (Wickham, 2016). While we have conceptualized the pathway as the flux of metabolites, it could be thought of as any resource being allocated to different traits. Question 1: How does variation in resource distribution within a population affect the strength and direction of trade-offs? We first simulated the simplest scenario where all individuals had the same total flux Jtotal = 1, in which case the phenotypic trade-off is expected to be most easily detected. We then modified this initial scenario to explore how variation in the distribution of resource acquisition (Jtotal) affected the strength and direction of trade-offs. Specifically, the resource distribution was systematically varied by sampling n = 103 total flux levels from a beta distribution, which has two parameters alpha and beta that control the size and shape of the distribution (Miller & Miller, 1999). When alpha is large and beta is small, the distribution is left skewed, whereas for small alpha and large beta, the distribution is right skewed. Likewise, for alpha = beta, the curve is symmetrical and approximately normal when the parameters are sufficiently large (>2). We can thus systematically vary the underlying resource distribution of a population by iterating through values of alpha and beta from 0.5 to 5 (in increments of 0.5), which was done using the NumPy Python package (v. 1.19.1) (Harris et al., 2020). The resulting slope of each linear regression of the flux at B1 and B2 (i.e., the two branching nodes) was then calculated using the lm function in R and plotted as a contour map using the latticeExtra Rpackage (v. 0.6-30) (Sarkar, 2008). Question 2: How does the complexity of the pathway used to produce traits affect the strength and direction of trade-offs? Metabolic pathways are typically more complex than what is described above. Most pathways consist of multiple branch points and multiple hierarchical levels. To understand how complexity affects the ability to detect trade-offs when combined with variation in the distribution of total flux we systematically manipulated the number of branch points and hierarchical levels within pathways (Fig. 1). We first explored the effect of adding branches to the pathway from the same node, such that instead of only branching off to nodes B1 and B2, the pathway branched to nodes B1 through to Bn (Fig. 1B), where n is the total number of branches (maximum n = 10 branches). Flux at a node was calculated as previously described, and the remaining flux was evenly distributed amongst the remaining nodes (i.e., nodes B2 through to Bnwould each receive J2-n = (Jtotal - JB1)/(n - 1) flux). For each pathway, we simulated flux using a beta distribution of Jtotalwith alpha = 5, beta = 0.5 to simulate a left skewed distribution, alpha = beta = 5 to simulate a normal distribution, and with alpha = 0.5, beta = 5 to simulate a right skewed distribution, as well as the simplest case where all individuals have total flux Jtotal = 1. We next considered how adding hierarchical levels to a metabolic pathway affected trade-offs. We modified our initial pathway with node A branching to nodes B1 and B2, and then node B2 further branched to nodes C1 and C2 (Fig. 1C). To compute the flux at the two new nodes C1 and C2, we simply repeated the same calculation as before, but using the flux at node B2, JB2, as the total flux. That is, the flux at node C1 was obtained by randomly sampling from the distribution at B2 with max = JB and min = 0, and the flux at node C2 is the remaining flux (JC = JB2 - JC1). Much like in the previous scenario with multiple branch points, we used three beta distributions (with the same parameters as before) to represent left, normal, and right skewed resource distributions, as well as the simplest case where Jtotal = 1 for all individuals. Quantile Regressions We performed quantile regression to understand whether this approach could help to detect trade-offs. Quantile regression is a form of statistical analysis that fits a curve through upper or lower quantiles of the data to assess whether an independent variable potentially sets a lower or upper limit to a response variable (Cade et al., 1999). This type of analysis is particularly useful when it is thought that an independent variable places a constraint on a response variable, yet variation in the response variable is influenced by many additional factors that add “noise” to the data, making a simple bivariate relationship difficult to detect (Thomson et al., 1996). Quantile regression is an extension of ordinary least squares regression, which regresses the best fitting line through the 50th percentile of the data. In addition to performing ordinary least squares regression for each pairwise comparison between the four nodes (B1, B2, C1, C2), we performed a series of quantile regressions using the ggplot2 R package (v. 3.4.2), where only the qth quantile was used for the regression (q = 0.99 and 0.95 to 0.5 in increments of 0.05, see Fig. S1) (Cade et al., 1999).
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
File List Supplement1.txt (MD5: faff226df1ad4adfa82f2ca1700d5010) Supplement2.txt (MD5: 51eeb647483834e7ea0742d81ffc4372)
Description
Supplement1.txt has R script to read in the college GPA example data from Burnham and Anderson 2002:226, estimates the 16 linear least squares regression models, computes AICc, AICc weights, variance inflation factors, partial standard deviations of predictors, standardizes estimates by partial standard deviations, computes model-averaged standardized estimates and their standard errors, and computes the model-averaged ratio of t statistics for unstandardized estimates (equivalent to model-averaged ratio of standardized estimates). The code is written to be transparent with respect to the mathematical operations rather than for efficiency.
Supplement2.txt has R script to generate the simulations in Appendix B for multi-part compositional predictors within a zero-truncated Poisson regression count model similar to the breeding sage-grouse count model of Rice et al. (2013).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. “Perfect” indicates calculating from the graph without changing edges between remaining nodes. “Mismatch” indicates the percentage of direct edges between remaining nodes that were incorrect. “Complete Mismatch” indicates 100% mismatch.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Conventional genome-wide association studies (GWAS) have been proven to be a successful strategy for identifying genetic variants associated with complex human traits. However, there is still a large heritability gap between GWAS and transitional family studies. The “missing heritability” has been suggested to be due to lack of studies focused on epistasis, also called gene–gene interactions, because individual trials have often had insufficient sample size. Meta-analysis is a common method for increasing statistical power. However, sufficient detailed information is difficult to obtain. A previous study employed a meta-regression-based method to detect epistasis, but it faced the challenge of inconsistent estimates. Here, we describe a Markov chain Monte Carlo-based method, called “Epistasis Test in Meta-Analysis” (ETMA), which uses genotype summary data to obtain consistent estimates of epistasis effects in meta-analysis. We defined a series of conditions to generate simulation data and tested the power and type I error rates in ETMA, individual data analysis and conventional meta-regression-based method. ETMA not only successfully facilitated consistency of evidence but also yielded acceptable type I error and higher power than conventional meta-regression. We applied ETMA to three real meta-analysis data sets. We found significant gene–gene interactions in the renin–angiotensin system and the polycyclic aromatic hydrocarbon metabolism pathway, with strong supporting evidence. In addition, glutathione S-transferase (GST) mu 1 and theta 1 were confirmed to exert independent effects on cancer. We concluded that the application of ETMA to real meta-analysis data was successful. Finally, we developed an R package, etma, for the detection of epistasis in meta-analysis [etma is available via the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/etma/index.html].
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.
This dataset was generated using R.
# Set seed for reproducibility
set.seed(42)
# Define number of observations (students)
n <- 5000
# Generate study hours (independent variable)
# Uniform distribution between 0 and 12 hours
study_hours <- runif(n, min = 0, max = 12)
# Create relationship between study hours and grade
# Base grade: 40 points
# Each study hour adds an average of 5 points
# Add normal noise (standard deviation = 10)
theoretical_grade <- 40 + 5 * study_hours
# Add normal noise to make it realistic
noise <- rnorm(n, mean = 0, sd = 10)
# Calculate final grade
grade <- theoretical_grade + noise
# Limit grades between 0 and 100
grade <- pmin(pmax(grade, 0), 100)
# Create the dataframe
dataset <- data.frame(
student_id = 1:n,
study_hours = round(study_hours, 2),
grade = round(grade, 2)
)
Facebook
TwitterThis data release contains the model inputs, outputs, and source code (written in R) for the boosted regression tree (BRT) and artificial neural network (ANN) models developed for four sites in Upper Klamath Lake which were used to simulate daily maximum pH and daily minimum dissolved oxygen (DO) from May 18th to October 4th in 2005-12 and 2015-19 at four sites, and to evaluate variable effects and their importance. Simulations were not developed for 2013 and 2014 due to a large amount of missing meteorological data. The sites included: 1) Williamson River (WMR), which was located in the northern portion of the lake near the mouth of the Williamson River and had a depth between 0.7 and 2.9 meters; 2) Rattlesnake Point (RPT), which was located near the southern portion of the lake and had a depth between 1.9 and 3.4 meters; 3) Mid-North (MDN), which was located in the northwest portion of the lake and a depth between 2.4 and 4.2 meters; 4) Mid-Trench (MDT) , which was located in the trench that runs along the western portion of the lake and had a depth between 13.2 and 15 meters.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
“Perfect” indicates calculating from the graph used to generate the data. “Mismatch” indicates the percentage of direct edges that were incorrect. Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. “Complete Mismatch” indicates 100% mismatch.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content.
Methods
Empirical trees used in the study are trees from the literature, collected by TimeTree (timetree.org).
Run Tree_Selection.R to select the empirical phylogenetic trees to be included from TimeTree. The output file final_timetrees.RData contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels.
2. Run Simulation_And_Analysis.R to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run Supplementary_Fig_S1_Analysis.R to generate data for Supplementary Figure S1.
3. Run Meta_analysis.R to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3).
4. Run Imbalance_Simulation.R to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes: fasta_to_revbayes_code_rate2.sh, fasta_to_revbayes_code_rate3.sh, fasta_to_revbayes_code_rate4.sh These shell scripts use the following .Rev files: MCMC_Revbayes_code_rate2.Rev, MCMC_Revbayes_code_rate3.Rev, MCMC_Revbayes_code_rate4.Rev And rely on the following supplementary .Rev files: tree_BD.Rev, sub_JC.Rev, clock_global.Rev 4b. IQ-TREE 2: fasta_to_iqtree_code.sh
5. Run Final_Figures.R to visualize the results.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Contains behavioural raw data, fMRI statistical maps and analysis scripts to reproduce the result of the study.
Description
Data
behaviour: Raw behavioural data data of all 40 participants.
fmri: fMRI statistical maps underlying Figure 4, Figure 5 and Tables S2-S5. Contains individual contrast images for all 40 participants and group level unthresholded T-maps. Also contains masks of the dorsal anterior cingulate cortex (dACC) and the dorsolateral prefrontal cortex (dlPFC), used for small volume correction.
model: Contains files with posterior samples and summary statistics of the three computational models Planning (PM), Simple (SM) and Hybrid (HM) and the reaction time analysis. Contains posterior predictive simulations for HM. Also contains the leave-one-out information criterion (LOOIC) for the PM, SM and HM used for model comparison.
Code
generate_figure2.py: Generates Figure 2.
generate_figure3.py: Generates Figure 3.
generate_S2_figure.py: Generates S2 Figure.
generate_S1_table.R: Logistic regression of choice against task features underlying S1 table.
fit_choice_models.ipynb: Jupyter notebook implementing model fitting and computation of LOOIC for the three models PM, SM and HM.
fit_RT_model.ipynb: Jupyter notebook implementing hierarchical Bayesian linear regression of response times against conflict and context.
parameter_recovery.ipynb: Jupyter notebook implementing parameter recovery analysis for model HM.
backward_induction.py: Function used to compute expected long-term values via backward induction.
simulate_agents.py: Simulate behavior of agents using a planning, simple, hybrid or random strategy.
Facebook
TwitterStandard indexes of poverty and deprivation are rarely sensitive to how the causes and consequences of deprivation have different impacts depending upon where a person lives. More geographically minded approaches are alert to spatial variations but are also difficult to compute using desktop PCs.
The aim of the ESRC sponsored project was to develop a method of spatial analysis known as ‘geographically weighted regression’ (GWR) to run in the high power computing environment offered by ‘Grid computation’ and e-social science. GWR, like many other methods of spatial analysis, is characterised by multiple repeat testing as the data are divided into geographical regions and also randomly redistributed many times to simulate the likelihood that the results obtained from the analysis are actually due to chance. Each of these tests requires computer time so, given a large dataset such as the UK Census statistics, running the analysis on a standard machine can take a long time! Fortunately, the computational grid is not standard but offers the possibility to speed up the process by running GWR’s sequences of calibration, analysis and non-parametric simulation in parallel.
An output is a model of the geographically varying correlates of car non-ownership fitted for the 165,665 Census Output Areas in England. Specifically, a geographically weighted regression of the relationship between the proportion of households without a car (or van) in 2001 (the dependent variable), and the following predictor variables: proportion of persons of working age unemployed; proportion of households in public housing; proportion of households that are lone parent households; proportion of persons 16 or above that are single; and proportion of persons that are white British.
Note - the file does not contain Census 2001 data, only National Grid references and regression coefficients.
Further information is available from the Grid Enabled Spatial Regression Models (With Application to Deprivation Indices) web page.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.12, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. “Perfect” indicates calculating from the graph without changing remaining edges. “Mismatch” indicates the percentage of remaining direct edges that were incorrect. “Complete Mismatch” indicates 100% mismatch.
Facebook
TwitterICARIA project had as one of its main purposes to develop coherent, reliable and usable downscaled climate projections from the last CMIP6 in order to construct the basis for efficient support to climate adaptation and decision-making of the related stakeholders, supporting the adaptation of critical assets within the project. These projections were obtained with also the purpose of being freely available for further use in subsequent studies and, hence, foster adaptation to climate change in more areas. Therefore, ICARIA’s climate information is already based on CMIP6 models and incorporating in its workflow the current SSPs. The presented high-resolution future climate projections display a unique dataset, being obtained from a high-quality and high-density set of weather observations that are then interpolated to the case studies of interest in a 100x100m resolution grid, which is the main outcome offered in this publication. These models will provide the scenarios to be considered within the Risk Assessment and the design and development of all adaptation measures coming as ICARIA outcomes.
For further details, find here a brief of the methodology followed:
The statistical downscaling methodology applied in ICARIA by FIC, named FICLIMA (Ribalaygua et al. 2013), consists of a two-step analogue/regression statistical method which has been used in national and international projects with good verification results (i.e.: Monjo et al. 2016). The first step is common for all simulated climate variables and it is based on an analogue stratification (Zorita et al. 1993). An analogue method was applied based on the hypothesis that ‘analogue’ atmospheric patterns (predictors) should cause analogue local effects (predictands), which means that the number of days that were most similar to the day to be downscaled was selected. The similarity between any two days was measured according to three nested synoptic windows (with different weights) and four large-scale fields using a pseudo-Euclidean distance between the large-scale fields used as predictors. For each predictor, the weighted Euclidean distance was calculated and standardised by substituting it with the closest percentile of a reference population of weighted Euclidean distances for that predictor. This method is a good method for reproducing nonlinear relationships between predictors and the predictands, but it could not be used to simulate values outside of the range of observed values. In order to overcome this problem and obtain a better simulation, a second step was required.
For this second step, the procedures applied depend on the variable of interest. To determine the temperature, multiple linear regression analysis for the selected number of most analogous days was performed for each station and for each problem day. From a group of potential predictors, the linear regression selected those with the highest correlation, using a forward and backward stepwise approach.
For precipitation, a group of m problem days (we use the whole days of a month) is downscaled. For each problem day we obtain a “preliminary precipitation amount” averaging the rain amount of its n most analogous days, so we can sort the m problem days from the highest to the lowest “preliminary precipitation amount”. For assigning the final precipitation amount, all amounts of the m×n analogous days are sorted and clustered in m groups. Every quantity is finally assigned, orderly, to the m days previously sorted by the “preliminary precipitation amount”.
For wind or relative humidity, the second step is a transfer function between the observed probability distribution and the simulated one using the averaged values from the n = 30 analogous days. Particularly, a parametric bias correction was performed to the time series obtained from the analogue stratification (first step). In order to estimate the improvement of this procedure, the bias correction was also applied to the direct model outputs.
This second step done at a daily scale with an inner thorough verification procedure is essential and the main differentiating process of FICLIMA method. It extends beyond mean values to include extremes and covers all time scales, including daily intervals. With the verification it can be proven If the method correctly simulates changes from one day to the next, indicating an effective capture of the underlying physical connections between predictors and predictands. These physical links remain relatively consistent, even in the face of climate change (as opposed to purely empirical relationships that might shift). In essence, this approach theoretically addresses the primary challenge in statistical downscaling known as the non-stationarity problem. This problem questions the stability of predictor/predictand relationships established in the past, probing whether these relationships will persist in the future.
The dataset shared here includes information for the three case studies tackled in ICARIA: Barcelona Metropolitan Area (AMB), Salzburg Region (SLZ), and South Aegean Region (SAR). The information provided covers data and outcomes by 10 models belonging to CMIP6. Each model has a historical archive, from 01/01/1950 to 31/12/2014 and 4 future scenarios (ssp126, ssp245, ssp370 and ssp585) ranging from 01/01/2015 to 31/12/2100. The relation of the selected models is detailed in the next Table:
Table 1. Information about the 10 climate models belonging to the 6 Coupled Model Intercomparison Project (CMIP6) corresponding to the IPCC AR6. Models were retrieved from the Earth System Grid Federation (ESGF) portal in support of the Program for Climate Model Diagnosis and Intercomparison (PCMDI).
CMIP6 MODELS
Resolution
Responsible Centre
References
ACCESS-CM2
1,875º x 1,250º
Australian Community Climate and Earth System Simulator (ACCESS), Australia
Bi, D. et al (2020)
BCC-CSM2-MR
1,125º x 1,121º
Beijing Climate Center (BCC), China Meteorological Administration, China.
Wu T. et al. (2019)
CanESM5
2,812º x 2,790º
Canadian Centre for Climate Modeling and Analysis (CC-CMA), Canadá.
Swart, N.C. et al. (2019)
CMCC-ESM2
1,000º x 1,000º
Centro Mediterraneo sui Cambiamenti Climatici (CMCC).
Cherchi et al, 2018
CNRM-ESM2-1
1,406º x 1,401º
CNRM (Centre National de Recherches Meteorologiques), Meteo-France, Francia.
Seferian, R. (2019)
EC-EARTH3
0,703º x 0,702º
EC-EARTH Consortium
EC-Earth Consortium. (2019)
MPI-ESM1-2-HR
0,938º x 0,935º
Max-Planck Institute for Meteorology (MPI-M), Germany.
Müller et al., (2018)
MRI-ESM2-0
1,125º x 1,121º
Meteorological Research Institute (MRI), Japan.
Yukimoto, S. et al. (2019)
NorESM2-MM
1,250º x 0,942º
Norwegian Climate Centre (NCC), Norway.
Bentsen, M. et al. (2019)
UKESM1-0-LL
1,875º x 1,250º
UK Met Office, Hadley Centre, United Kingdom
Good, P. et al. (2019)
The climate projections have been developed over each of the observational locations that were retrieved to run the statistical downscaling. The results from these projections have been spatially interpolated into a 100x100m grid with a Multi-lineal Regression Model considering diverse adjustments and topographic corrections. The results presented here are the median of the 10 models used, obtained for each of the 4 SSPs and each of the time periods considered in ICARIA until the year 2100. The variables treated belong to the main climate variables and their related extreme indicators as they were defined during the ICARIA project. You can find here a summary table of all the variables and indicators that were used to develop the projections. Table 2. Summary of selected thermal and precipitation indicators, grouped aligned with the main hazards they feed. “nd” = number of days; “ne” = number of events.
Index/name
Short description
Source
Variable
Units
Threshold
Thermal indicators
TX90 / TX10
Warm/cold days
Zhang et al. (2011)
TX
nd
90 / 10%
HD
Heat day
ICARIA
TX
nd
30 °C
EHD
Extreme heat day
ICARIA
TX
nd
35 °C
TR
Tropical nights
Zhang et al. (2011)
TN
nd
20 °C
EQ
Equatorial nights
AEMet 2020, ICARIA
TN
nd
25 °C
IN
Infernal nights
ICARIA
TN
nd
30 °C
FD
Frost days
Zhang et al. (2011)
TN
nd
< 0 °C
Max consec
Max spell length for above thermal indicators
ICARIA
-
nd
-
Nº events
Number of above thermal indicators events
ICARIA
-
ne
3 days
TXm
Mean maximum temperatures
ICARIA
TX
°C
-
TNm
Mean minimum temperatures
ICARIA
TN
°C
-
TM
Mean temperatures
ICARIA
TA
°C
-
HWle
Heatwave length
ICARIA
TX
nd
3d > 95% TX
HWim/HWix
Mean and maximum heatwave intensity
ICARIA
TX
°C
3d > 95% TX
HWf
Heatwave frequency
ICARIA
TX
ne
3d > 95% TX
HWd
Heatwave days
ICARIA
TX
nd
3d > 95% TX
HI - P90
Heat Index (percentile 90)
NWS (1994)
TX, RH
°C
TX>27 °C, HR> 40%
UTCI
Universal Thermal Climate Index
Bröde et al. (2012)
TARH, W
-
-
UHI
Isla de calor (BCN) anual y estacional
AMB, Metrobs 2015
T
°C
TM1-TM2 > 0 °C
Precipitation indicators
R20
Number of heavy precipitation days
Zhang et al. (2011)
P
nd
20 mm
R50, R100
Days with extreme heavy rain
AMB et al. (2017)
P
nd
50mm
100mm
Ra
Yearly and seasonal rainfall relative change
ICARIA
P
mm
≥ 0.1mm
IDF - CCF
IDF Curves - Climate Change Factor
Arnbjerg-Nielsen (2012)
P
-
≥ 0.1mm
Forest fire
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Numerical data used to generate all graphs and figures.
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.
Each row in the dataset consists of:
Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:
Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.