15 datasets found

Study Hours ,Student Scores for Linear Regression
kaggle.com
Updated Sep 23, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
douaa bennoune (2024). Study Hours ,Student Scores for Linear Regression [Dataset]. https://www.kaggle.com/datasets/douaabennoune/study-hours-student-scores-for-linear-regression
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 23, 2024
Dataset provided by
Kaggle
Authors
douaa bennoune
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.

Each row in the dataset consists of:

Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:

Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.
Supplement 1. R code used to perform simulations.
wiley.figshare.com
html
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sarah Emerson; Charlotte Wickham; Kenneth J. Ruzicka Jr. (2023). Supplement 1. R code used to perform simulations. [Dataset]. http://doi.org/10.6084/m9.figshare.3562644.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3562644.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Sarah Emerson; Charlotte Wickham; Kenneth J. Ruzicka Jr.
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List simulations.R (MD5: bdda5503ab6ec0d1374d340d60f562d6)

Description This comment used the attached R script to conduct simulation studies of spatial component regression (SCR). The file simulations.R contain all code needed to run the simulations to test SCR performance for three objectives: (1) inference under the null hypothesis; (2) inference when the predictor of inference does have an effect on the outcome; and (3) matrix selections. The code will simulate 16 sets of 1000 data sets each. The 16 data sets represent all possible combinations of 2 different spatial predictor types 4 autocorrelation types and 2 effect sizes for the spatial predictor.
Data from: The improbability of detecting trade-offs and some practical...
data.niaid.nih.gov
dataone.org
+2more
zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Johnson (2024). The improbability of detecting trade-offs and some practical solutions [Dataset]. http://doi.org/10.5061/dryad.xpnvx0kq5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xpnvx0kq5
Dataset updated
Jul 19, 2024
Dataset provided by
University of Toronto
Authors
Marc Johnson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Trade-offs are a fundamental concept in evolutionary biology because they are thought to explain much of nature’s biological diversity, from variation in life-histories to differences in metabolism. Despite the predicted importance of trade-offs, they are notoriously difficult to detect. Here we contribute to the existing rich theoretical literature on trade-offs by examining how the shape of the distribution of resources or metabolites acquired in an allocation pathway influences the strength of trade-offs between traits. We further explore how variation in resource distribution interacts with two aspects of pathway complexity (i.e., the number of branches and hierarchical structure) affects tradeoffs. We simulate variation in the shape of the distribution of a resource by sampling 106 individuals from a beta distribution with varying parameters to alter the resource shape. In a simple “Y-model” allocation of resources to two traits, any variation in a resource leads to slopes less than -1, with left skewed and symmetrical distributions leading to negative relationships between traits, and highly right skewed distributions associated with positive relationships between traits. Adding more branches further weakens negative and positive relationships between traits, and the hierarchical structure of pathways typically weakens relationships between traits, although in some contexts hierarchical complexity can strengthen positive relationships between traits. Our results further illuminate how variation in the acquisition and allocation of resources, and particularly the shape of a resource distribution and how it interacts with pathway complexity, makes it challenging to detect trade-offs. We offer several practical suggestions on how to detect trade-offs given these challenges. Methods Overview of Flux Simulations To study the strength and direction of trade-offs within a population, we developed a simulation of flux in a simple metabolic pathway, where a precursor metabolite emerging from node A may either be converted to metabolic products B1 or B2 (Fig. 1). This conception of a pathway is similar to De Jong and Van Noordwijk’s Y-model (Van Noordwijk & De Jong, 1986; De Jong & Van Noordwijk, 1992), but we used simulation instead of analytical statistical models to allow us to consider greater complexity in the distribution of variables and pathways. For a simple pathway (Fig. 1), the total flux Jtotal (i.e., the flux at node A, denoted as JA) for each individual (N = 106) was first sampled from a predetermined beta distribution as described below. The flux at node B1 (JB1) was then randomly sampled from this distribution with max = Jtotal = JA and min = 0. The flux at the remaining node, B2, was then simply the remaining flux (JB2 = JA - JB1). Simulations of more complex pathways followed the same basic approach as described above, with increased numbers of branches and hierarchical levels added to the pathway as described below under Question 2. The metabolic pathways were simulated using Python (v. 3.8.2) (Van Rossum & Drake Jr., 2009) where we could control the underlying distribution of metabolite allocation. The output flux at nodes B1 and B2 was plotted using R (v. 4.2.1) (Team, 2022) with the resulting trade-off visualized as a linear regression using the ggplot2 R package (v. 3.4.2) (Wickham, 2016). While we have conceptualized the pathway as the flux of metabolites, it could be thought of as any resource being allocated to different traits. Question 1: How does variation in resource distribution within a population affect the strength and direction of trade-offs? We first simulated the simplest scenario where all individuals had the same total flux Jtotal = 1, in which case the phenotypic trade-off is expected to be most easily detected. We then modified this initial scenario to explore how variation in the distribution of resource acquisition (Jtotal) affected the strength and direction of trade-offs. Specifically, the resource distribution was systematically varied by sampling n = 103 total flux levels from a beta distribution, which has two parameters alpha and beta that control the size and shape of the distribution (Miller & Miller, 1999). When alpha is large and beta is small, the distribution is left skewed, whereas for small alpha and large beta, the distribution is right skewed. Likewise, for alpha = beta, the curve is symmetrical and approximately normal when the parameters are sufficiently large (>2). We can thus systematically vary the underlying resource distribution of a population by iterating through values of alpha and beta from 0.5 to 5 (in increments of 0.5), which was done using the NumPy Python package (v. 1.19.1) (Harris et al., 2020). The resulting slope of each linear regression of the flux at B1 and B2 (i.e., the two branching nodes) was then calculated using the lm function in R and plotted as a contour map using the latticeExtra Rpackage (v. 0.6-30) (Sarkar, 2008). Question 2: How does the complexity of the pathway used to produce traits affect the strength and direction of trade-offs? Metabolic pathways are typically more complex than what is described above. Most pathways consist of multiple branch points and multiple hierarchical levels. To understand how complexity affects the ability to detect trade-offs when combined with variation in the distribution of total flux we systematically manipulated the number of branch points and hierarchical levels within pathways (Fig. 1). We first explored the effect of adding branches to the pathway from the same node, such that instead of only branching off to nodes B1 and B2, the pathway branched to nodes B1 through to Bn (Fig. 1B), where n is the total number of branches (maximum n = 10 branches). Flux at a node was calculated as previously described, and the remaining flux was evenly distributed amongst the remaining nodes (i.e., nodes B2 through to Bnwould each receive J2-n = (Jtotal - JB1)/(n - 1) flux). For each pathway, we simulated flux using a beta distribution of Jtotalwith alpha = 5, beta = 0.5 to simulate a left skewed distribution, alpha = beta = 5 to simulate a normal distribution, and with alpha = 0.5, beta = 5 to simulate a right skewed distribution, as well as the simplest case where all individuals have total flux Jtotal = 1. We next considered how adding hierarchical levels to a metabolic pathway affected trade-offs. We modified our initial pathway with node A branching to nodes B1 and B2, and then node B2 further branched to nodes C1 and C2 (Fig. 1C). To compute the flux at the two new nodes C1 and C2, we simply repeated the same calculation as before, but using the flux at node B2, JB2, as the total flux. That is, the flux at node C1 was obtained by randomly sampling from the distribution at B2 with max = JB and min = 0, and the flux at node C2 is the remaining flux (JC = JB2 - JC1). Much like in the previous scenario with multiple branch points, we used three beta distributions (with the same parameters as before) to represent left, normal, and right skewed resource distributions, as well as the simplest case where Jtotal = 1 for all individuals. Quantile Regressions We performed quantile regression to understand whether this approach could help to detect trade-offs. Quantile regression is a form of statistical analysis that fits a curve through upper or lower quantiles of the data to assess whether an independent variable potentially sets a lower or upper limit to a response variable (Cade et al., 1999). This type of analysis is particularly useful when it is thought that an independent variable places a constraint on a response variable, yet variation in the response variable is influenced by many additional factors that add “noise” to the data, making a simple bivariate relationship difficult to detect (Thomson et al., 1996). Quantile regression is an extension of ordinary least squares regression, which regresses the best fitting line through the 50th percentile of the data. In addition to performing ordinary least squares regression for each pairwise comparison between the four nodes (B1, B2, C1, C2), we performed a series of quantile regressions using the ggplot2 R package (v. 3.4.2), where only the qth quantile was used for the regression (q = 0.99 and 0.95 to 0.5 in increments of 0.05, see Fig. S1) (Cade et al., 1999).
Supplement 1. R code to compute model-averaged regression coefficients for...
wiley.figshare.com
html
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brian S. Cade (2023). Supplement 1. R code to compute model-averaged regression coefficients for Burnham and Anderson (2002) college gpa example and to simulate multi-part compositional predictors and model-averaged estimates similar to Rice et al (2013) zero-truncated Poisson regression count model for Greater Sage-Grouse. [Dataset]. http://doi.org/10.6084/m9.figshare.3562905.v1
Explore at:
htmlAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.3562905.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Brian S. Cade
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
File List Supplement1.txt (MD5: faff226df1ad4adfa82f2ca1700d5010) Supplement2.txt (MD5: 51eeb647483834e7ea0742d81ffc4372)

Description Supplement1.txt has R script to read in the college GPA example data from Burnham and Anderson 2002:226, estimates the 16 linear least squares regression models, computes AICc, AICc weights, variance inflation factors, partial standard deviations of predictors, standardizes estimates by partial standard deviations, computes model-averaged standardized estimates and their standard errors, and computes the model-averaged ratio of t statistics for unstandardized estimates (equivalent to model-averaged ratio of standardized estimates). The code is written to be transparent with respect to the mathematical operations rather than for efficiency. Supplement2.txt has R script to generate the simulations in Appendix B for multi-part compositional predictors within a zero-truncated Poisson regression count model similar to the breeding sage-grouse count model of Rice et al. (2013).
Type 1 error rates using pathways with dropped nodes.
plos.figshare.com
xls
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using pathways with dropped nodes. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008986.t004
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. “Perfect” indicates calculating from the graph without changing edges between remaining nodes. “Mismatch” indicates the percentage of direct edges between remaining nodes that were incorrect. “Complete Mismatch” indicates 100% mismatch.
Epistasis Test in Meta-Analysis: A Multi-Parameter Markov Chain Monte Carlo...
plos.figshare.com
pdf
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chin Lin; Chi-Ming Chu; Sui-Lung Su (2023). Epistasis Test in Meta-Analysis: A Multi-Parameter Markov Chain Monte Carlo Model for Consistency of Evidence [Dataset]. http://doi.org/10.1371/journal.pone.0152891
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0152891
Dataset updated
Jun 2, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Chin Lin; Chi-Ming Chu; Sui-Lung Su
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Conventional genome-wide association studies (GWAS) have been proven to be a successful strategy for identifying genetic variants associated with complex human traits. However, there is still a large heritability gap between GWAS and transitional family studies. The “missing heritability” has been suggested to be due to lack of studies focused on epistasis, also called gene–gene interactions, because individual trials have often had insufficient sample size. Meta-analysis is a common method for increasing statistical power. However, sufficient detailed information is difficult to obtain. A previous study employed a meta-regression-based method to detect epistasis, but it faced the challenge of inconsistent estimates. Here, we describe a Markov chain Monte Carlo-based method, called “Epistasis Test in Meta-Analysis” (ETMA), which uses genotype summary data to obtain consistent estimates of epistasis effects in meta-analysis. We defined a series of conditions to generate simulation data and tested the power and type I error rates in ETMA, individual data analysis and conventional meta-regression-based method. ETMA not only successfully facilitated consistency of evidence but also yielded acceptable type I error and higher power than conventional meta-regression. We applied ETMA to three real meta-analysis data sets. We found significant gene–gene interactions in the renin–angiotensin system and the polycyclic aromatic hydrocarbon metabolism pathway, with strong supporting evidence. In addition, glutathione S-transferase (GST) mu 1 and theta 1 were confirmed to exert independent effects on cancer. We concluded that the application of ETMA to real meta-analysis data was successful. Finally, we developed an R package, etma, for the detection of epistasis in meta-analysis [etma is available via the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/etma/index.html].
Study Hours vs Grades Dataset
kaggle.com
zip
Updated Oct 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
Explore at:
zip(33964 bytes)Available download formats
Dataset updated
Oct 12, 2025
Authors
Andrey Silva
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

Dataset Features

student_id: Unique identifier for each student (1-5000)

study_hours: Hours spent studying (0-12 hours, continuous)

grade: Final exam score (0-100 points, continuous)

Potential Use Cases

Linear regression modeling and practice

Data visualization exercises

Statistical analysis tutorials

Machine learning for beginners

Educational research simulations

Data Quality

No missing values

Normally distributed residuals

Realistic educational scenario

Ready for immediate analysis

Data Generation Code

This dataset was generated using R.

R Code

# Set seed for reproducibility set.seed(42) # Define number of observations (students) n <- 5000 # Generate study hours (independent variable) # Uniform distribution between 0 and 12 hours study_hours <- runif(n, min = 0, max = 12) # Create relationship between study hours and grade # Base grade: 40 points # Each study hour adds an average of 5 points # Add normal noise (standard deviation = 10) theoretical_grade <- 40 + 5 * study_hours # Add normal noise to make it realistic noise <- rnorm(n, mean = 0, sd = 10) # Calculate final grade grade <- theoretical_grade + noise # Limit grades between 0 and 100 grade <- pmin(pmax(grade, 0), 100) # Create the dataframe dataset <- data.frame( student_id = 1:n, study_hours = round(study_hours, 2), grade = round(grade, 2) )
d
Data from: Input and results from boosted regression tree and artificial...
catalog.data.gov
data.usgs.gov
+1more
Updated Oct 22, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2025). Input and results from boosted regression tree and artificial neural network models that predict daily maximum pH and daily minimum dissolved oxygen in Upper Klamath Lake, 2005-2019 [Dataset]. https://catalog.data.gov/dataset/input-and-results-from-boosted-regression-tree-and-artificial-neural-network-models-t-2005
Explore at:
Dataset updated
Oct 22, 2025
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Area covered
Upper Klamath Lake
Description
This data release contains the model inputs, outputs, and source code (written in R) for the boosted regression tree (BRT) and artificial neural network (ANN) models developed for four sites in Upper Klamath Lake which were used to simulate daily maximum pH and daily minimum dissolved oxygen (DO) from May 18th to October 4th in 2005-12 and 2015-19 at four sites, and to evaluate variable effects and their importance. Simulations were not developed for 2013 and 2014 due to a large amount of missing meteorological data. The sites included: 1) Williamson River (WMR), which was located in the northern portion of the lake near the mouth of the Williamson River and had a depth between 0.7 and 2.9 meters; 2) Rattlesnake Point (RPT), which was located near the southern portion of the lake and had a depth between 1.9 and 3.4 meters; 3) Mid-North (MDN), which was located in the northwest portion of the lake and a depth between 2.4 and 4.2 meters; 4) Mid-Trench (MDT) , which was located in the trench that runs along the western portion of the lake and had a depth between 13.2 and 15 meters.
Type 1 error rates using all pathway information, i.e., no nodes or edges...
plos.figshare.com
xls
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using all pathway information, i.e., no nodes or edges were dropped for these simulations. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008986.t001
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
“Perfect” indicates calculating from the graph used to generate the data. “Mismatch” indicates the percentage of direct edges that were incorrect. Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. “Complete Mismatch” indicates 100% mismatch.
n
Data from: The limits of the constant-rate birth-death prior for...
data.niaid.nih.gov
search.dataone.org
+2more
zip
Updated Nov 6, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark Poulsen Khurana; Neil Scheidwasser-Clow; Matthew Penn; Samir Bhatt; David A. Duchêne (2023). The limits of the constant-rate birth-death prior for phylogenetic tree topology inference [Dataset]. http://doi.org/10.5061/dryad.2fqz612vg
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.2fqz612vg
Dataset updated
Nov 6, 2023
Dataset provided by
University of Copenhagen
University of Oxford
Authors
Mark Poulsen Khurana; Neil Scheidwasser-Clow; Matthew Penn; Samir Bhatt; David A. Duchêne
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content. Methods Empirical trees used in the study are trees from the literature, collected by TimeTree (timetree.org). Run Tree_Selection.R to select the empirical phylogenetic trees to be included from TimeTree. The output file final_timetrees.RData contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels. 2. Run Simulation_And_Analysis.R to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run Supplementary_Fig_S1_Analysis.R to generate data for Supplementary Figure S1. 3. Run Meta_analysis.R to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3). 4. Run Imbalance_Simulation.R to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes: fasta_to_revbayes_code_rate2.sh, fasta_to_revbayes_code_rate3.sh, fasta_to_revbayes_code_rate4.sh These shell scripts use the following .Rev files: MCMC_Revbayes_code_rate2.Rev, MCMC_Revbayes_code_rate3.Rev, MCMC_Revbayes_code_rate4.Rev And rely on the following supplementary .Rev files: tree_BD.Rev, sub_JC.Rev, clock_global.Rev 4b. IQ-TREE 2: fasta_to_iqtree_code.sh 5. Run Final_Figures.R to visualize the results.
fmott/context_dependent_planning: v1.0
zenodo.org
zip
Updated May 9, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
fmott; fmott (2022). fmott/context_dependent_planning: v1.0 [Dataset]. http://doi.org/10.5281/zenodo.5112966
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5112966
Dataset updated
May 9, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
fmott; fmott
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Contains behavioural raw data, fMRI statistical maps and analysis scripts to reproduce the result of the study.

Description

Data

behaviour: Raw behavioural data data of all 40 participants.

fmri: fMRI statistical maps underlying Figure 4, Figure 5 and Tables S2-S5. Contains individual contrast images for all 40 participants and group level unthresholded T-maps. Also contains masks of the dorsal anterior cingulate cortex (dACC) and the dorsolateral prefrontal cortex (dlPFC), used for small volume correction.

model: Contains files with posterior samples and summary statistics of the three computational models Planning (PM), Simple (SM) and Hybrid (HM) and the reaction time analysis. Contains posterior predictive simulations for HM. Also contains the leave-one-out information criterion (LOOIC) for the PM, SM and HM used for model comparison.

Code

generate_figure2.py: Generates Figure 2.

generate_figure3.py: Generates Figure 3.

generate_S2_figure.py: Generates S2 Figure.

generate_S1_table.R: Logistic regression of choice against task features underlying S1 table.

fit_choice_models.ipynb: Jupyter notebook implementing model fitting and computation of LOOIC for the three models PM, SM and HM.

fit_RT_model.ipynb: Jupyter notebook implementing hierarchical Bayesian linear regression of response times against conflict and context.

parameter_recovery.ipynb: Jupyter notebook implementing parameter recovery analysis for model HM.

backward_induction.py: Function used to compute expected long-term values via backward induction.

simulate_agents.py: Simulate behavior of agents using a planning, simple, hybrid or random strategy.
u
Geographically Varying Correlates of Car Non-Ownership in Census Output...
datacatalogue.ukdataservice.ac.uk
Updated Apr 2, 2009
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Harris, R., University of Bristol, School of Geographical Sciences; Grose, D., Lancaster University, Centre for e-Science (2009). Geographically Varying Correlates of Car Non-Ownership in Census Output Areas of England, 2001 [Dataset]. http://doi.org/10.5255/UKDA-SN-6100-1
Explore at:
Unique identifier
https://doi.org/10.5255/UKDA-SN-6100-1
Dataset updated
Apr 2, 2009
Dataset provided by
UK Data Servicehttps://ukdataservice.ac.uk/
Authors
Harris, R., University of Bristol, School of Geographical Sciences; Grose, D., Lancaster University, Centre for e-Science
Area covered
England
Description
Standard indexes of poverty and deprivation are rarely sensitive to how the causes and consequences of deprivation have different impacts depending upon where a person lives. More geographically minded approaches are alert to spatial variations but are also difficult to compute using desktop PCs.

The aim of the ESRC sponsored project was to develop a method of spatial analysis known as ‘geographically weighted regression’ (GWR) to run in the high power computing environment offered by ‘Grid computation’ and e-social science. GWR, like many other methods of spatial analysis, is characterised by multiple repeat testing as the data are divided into geographical regions and also randomly redistributed many times to simulate the likelihood that the results obtained from the analysis are actually due to chance. Each of these tests requires computer time so, given a large dataset such as the UK Census statistics, running the analysis on a standard machine can take a long time! Fortunately, the computational grid is not standard but offers the possibility to speed up the process by running GWR’s sequences of calibration, analysis and non-parametric simulation in parallel.

An output is a model of the geographically varying correlates of car non-ownership fitted for the 165,665 Census Output Areas in England. Specifically, a geographically weighted regression of the relationship between the proportion of households without a car (or van) in 2001 (the dependent variable), and the following predictor variables: proportion of persons of working age unemployed; proportion of households in public housing; proportion of households that are lone parent households; proportion of persons 16 or above that are single; and proportion of persons that are white British.

Note - the file does not contain Census 2001 data, only National Grid references and regression coefficients.

Further information is available from the Grid Enabled Spatial Regression Models (With Application to Deprivation Indices) web page.
Type 1 error rates using pathways with 5% missing edges.
plos.figshare.com
xls
Updated Jun 9, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using pathways with 5% missing edges. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1008986.t002
Dataset updated
Jun 9, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.12, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. “Perfect” indicates calculating from the graph without changing remaining edges. “Mismatch” indicates the percentage of remaining direct edges that were incorrect. “Complete Mismatch” indicates 100% mismatch.
u
Data from: ICARIA: spatially distributed climate projections from...
produccioncientifica.ucm.es
data.niaid.nih.gov
+2more
Updated 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.; Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E. (2024). ICARIA: spatially distributed climate projections from statistical downscaling [Dataset]. https://produccioncientifica.ucm.es/documentos/67321d3daea56d4af0484391
Explore at:
Dataset updated
2024
Authors
Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.; Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.
Description
ICARIA project had as one of its main purposes to develop coherent, reliable and usable downscaled climate projections from the last CMIP6 in order to construct the basis for efficient support to climate adaptation and decision-making of the related stakeholders, supporting the adaptation of critical assets within the project. These projections were obtained with also the purpose of being freely available for further use in subsequent studies and, hence, foster adaptation to climate change in more areas. Therefore, ICARIA’s climate information is already based on CMIP6 models and incorporating in its workflow the current SSPs. The presented high-resolution future climate projections display a unique dataset, being obtained from a high-quality and high-density set of weather observations that are then interpolated to the case studies of interest in a 100x100m resolution grid, which is the main outcome offered in this publication. These models will provide the scenarios to be considered within the Risk Assessment and the design and development of all adaptation measures coming as ICARIA outcomes.

For further details, find here a brief of the methodology followed:

The statistical downscaling methodology applied in ICARIA by FIC, named FICLIMA (Ribalaygua et al. 2013), consists of a two-step analogue/regression statistical method which has been used in national and international projects with good verification results (i.e.: Monjo et al. 2016). The first step is common for all simulated climate variables and it is based on an analogue stratification (Zorita et al. 1993). An analogue method was applied based on the hypothesis that ‘analogue’ atmospheric patterns (predictors) should cause analogue local effects (predictands), which means that the number of days that were most similar to the day to be downscaled was selected. The similarity between any two days was measured according to three nested synoptic windows (with different weights) and four large-scale fields using a pseudo-Euclidean distance between the large-scale fields used as predictors. For each predictor, the weighted Euclidean distance was calculated and standardised by substituting it with the closest percentile of a reference population of weighted Euclidean distances for that predictor. This method is a good method for reproducing nonlinear relationships between predictors and the predictands, but it could not be used to simulate values outside of the range of observed values. In order to overcome this problem and obtain a better simulation, a second step was required.

For this second step, the procedures applied depend on the variable of interest. To determine the temperature, multiple linear regression analysis for the selected number of most analogous days was performed for each station and for each problem day. From a group of potential predictors, the linear regression selected those with the highest correlation, using a forward and backward stepwise approach.

For precipitation, a group of m problem days (we use the whole days of a month) is downscaled. For each problem day we obtain a “preliminary precipitation amount” averaging the rain amount of its n most analogous days, so we can sort the m problem days from the highest to the lowest “preliminary precipitation amount”. For assigning the final precipitation amount, all amounts of the m×n analogous days are sorted and clustered in m groups. Every quantity is finally assigned, orderly, to the m days previously sorted by the “preliminary precipitation amount”.

For wind or relative humidity, the second step is a transfer function between the observed probability distribution and the simulated one using the averaged values from the n = 30 analogous days. Particularly, a parametric bias correction was performed to the time series obtained from the analogue stratification (first step). In order to estimate the improvement of this procedure, the bias correction was also applied to the direct model outputs.

This second step done at a daily scale with an inner thorough verification procedure is essential and the main differentiating process of FICLIMA method. It extends beyond mean values to include extremes and covers all time scales, including daily intervals. With the verification it can be proven If the method correctly simulates changes from one day to the next, indicating an effective capture of the underlying physical connections between predictors and predictands. These physical links remain relatively consistent, even in the face of climate change (as opposed to purely empirical relationships that might shift). In essence, this approach theoretically addresses the primary challenge in statistical downscaling known as the non-stationarity problem. This problem questions the stability of predictor/predictand relationships established in the past, probing whether these relationships will persist in the future.

The dataset shared here includes information for the three case studies tackled in ICARIA: Barcelona Metropolitan Area (AMB), Salzburg Region (SLZ), and South Aegean Region (SAR). The information provided covers data and outcomes by 10 models belonging to CMIP6. Each model has a historical archive, from 01/01/1950 to 31/12/2014 and 4 future scenarios (ssp126, ssp245, ssp370 and ssp585) ranging from 01/01/2015 to 31/12/2100. The relation of the selected models is detailed in the next Table:

Table 1. Information about the 10 climate models belonging to the 6 Coupled Model Intercomparison Project (CMIP6) corresponding to the IPCC AR6. Models were retrieved from the Earth System Grid Federation (ESGF) portal in support of the Program for Climate Model Diagnosis and Intercomparison (PCMDI).

CMIP6 MODELS

Resolution

Responsible Centre

References

ACCESS-CM2

1,875º x 1,250º

Australian Community Climate and Earth System Simulator (ACCESS), Australia

Bi, D. et al (2020)

BCC-CSM2-MR

1,125º x 1,121º

Beijing Climate Center (BCC), China Meteorological Administration, China.

Wu T. et al. (2019)

CanESM5

2,812º x 2,790º

Canadian Centre for Climate Modeling and Analysis (CC-CMA), Canadá.

Swart, N.C. et al. (2019)

CMCC-ESM2

1,000º x 1,000º

Centro Mediterraneo sui Cambiamenti Climatici (CMCC).

Cherchi et al, 2018

CNRM-ESM2-1

1,406º x 1,401º

CNRM (Centre National de Recherches Meteorologiques), Meteo-France, Francia.

Seferian, R. (2019)

EC-EARTH3

0,703º x 0,702º

EC-EARTH Consortium

EC-Earth Consortium. (2019)

MPI-ESM1-2-HR

0,938º x 0,935º

Max-Planck Institute for Meteorology (MPI-M), Germany.

Müller et al., (2018)

MRI-ESM2-0

1,125º x 1,121º

Meteorological Research Institute (MRI), Japan.

Yukimoto, S. et al. (2019)

NorESM2-MM

1,250º x 0,942º

Norwegian Climate Centre (NCC), Norway.

Bentsen, M. et al. (2019)

UKESM1-0-LL

1,875º x 1,250º

UK Met Office, Hadley Centre, United Kingdom

Good, P. et al. (2019)

The climate projections have been developed over each of the observational locations that were retrieved to run the statistical downscaling. The results from these projections have been spatially interpolated into a 100x100m grid with a Multi-lineal Regression Model considering diverse adjustments and topographic corrections. The results presented here are the median of the 10 models used, obtained for each of the 4 SSPs and each of the time periods considered in ICARIA until the year 2100. The variables treated belong to the main climate variables and their related extreme indicators as they were defined during the ICARIA project. You can find here a summary table of all the variables and indicators that were used to develop the projections. Table 2. Summary of selected thermal and precipitation indicators, grouped aligned with the main hazards they feed. “nd” = number of days; “ne” = number of events.

Index/name

Short description

Source

Variable

Units

Threshold

Thermal indicators

TX90 / TX10

Warm/cold days

Zhang et al. (2011)

TX

nd

90 / 10%

HD

Heat day

ICARIA

TX

nd

30 °C

EHD

Extreme heat day

ICARIA

TX

nd

35 °C

TR

Tropical nights

Zhang et al. (2011)

TN

nd

20 °C

EQ

Equatorial nights

AEMet 2020, ICARIA

TN

nd

25 °C

IN

Infernal nights

ICARIA

TN

nd

30 °C

FD

Frost days

Zhang et al. (2011)

TN

nd

< 0 °C

Max consec

Max spell length for above thermal indicators

ICARIA

-

nd

-

Nº events

Number of above thermal indicators events

ICARIA

-

ne

3 days

TXm

Mean maximum temperatures

ICARIA

TX

°C

-

TNm

Mean minimum temperatures

ICARIA

TN

°C

-

TM

Mean temperatures

ICARIA

TA

°C

-

HWle

Heatwave length

ICARIA

TX

nd

3d > 95% TX

HWim/HWix

Mean and maximum heatwave intensity

ICARIA

TX

°C

3d > 95% TX

HWf

Heatwave frequency

ICARIA

TX

ne

3d > 95% TX

HWd

Heatwave days

ICARIA

TX

nd

3d > 95% TX

HI - P90

Heat Index (percentile 90)

NWS (1994)

TX, RH

°C

TX>27 °C, HR> 40%

UTCI

Universal Thermal Climate Index

Bröde et al. (2012)

TARH, W

-

-

UHI

Isla de calor (BCN) anual y estacional

AMB, Metrobs 2015

T

°C

TM1-TM2 > 0 °C

Precipitation indicators

R20

Number of heavy precipitation days

Zhang et al. (2011)

P

nd

20 mm

R50, R100

Days with extreme heavy rain

AMB et al. (2017)

P

nd

50mm

100mm

Ra

Yearly and seasonal rainfall relative change

ICARIA

P

mm

≥ 0.1mm

IDF - CCF

IDF Curves - Climate Change Factor

Arnbjerg-Nielsen (2012)

P

-

≥ 0.1mm

Forest fire
Numerical data used to generate all graphs and figures.
plos.figshare.com
xlsx
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vijaykumar S. Jatti; A. Saiyathibrahim; Arvind Yadav; Murali Krishnan R.; B. Jayaprakash; Sumit Kaushal; Vinaykumar S. Jatti; Ashwini V. Jatti; Savita V. Jatti; Abhinav Kumar; Soumaya Gouadria; Ebenezer Bonyah (2025). Numerical data used to generate all graphs and figures. [Dataset]. http://doi.org/10.1371/journal.pone.0324049.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0324049.s001
Dataset updated
Jun 2, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Vijaykumar S. Jatti; A. Saiyathibrahim; Arvind Yadav; Murali Krishnan R.; B. Jayaprakash; Sumit Kaushal; Vinaykumar S. Jatti; Ashwini V. Jatti; Savita V. Jatti; Abhinav Kumar; Soumaya Gouadria; Ebenezer Bonyah
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Numerical data used to generate all graphs and figures.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

douaa bennoune (2024). Study Hours ,Student Scores for Linear Regression [Dataset]. https://www.kaggle.com/datasets/douaabennoune/study-hours-student-scores-for-linear-regression

Study Hours ,Student Scores for Linear Regression

Study Hours and Student Scores Dataset for Linear Regression Analysis

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Sep 23, 2024

Dataset provided by

Kaggle

Authors

douaa bennoune

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.

Each row in the dataset consists of:

Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:

Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.

Clear search

Close search

Google apps

Main menu

Study Hours ,Student Scores for Linear Regression

Supplement 1. R code used to perform simulations.

Data from: The improbability of detecting trade-offs and some practical...

Supplement 1. R code to compute model-averaged regression coefficients for...

Type 1 error rates using pathways with dropped nodes.

Epistasis Test in Meta-Analysis: A Multi-Parameter Markov Chain Monte Carlo...

Study Hours vs Grades Dataset

Dataset Features

Potential Use Cases

Data Quality

Data Generation Code

R Code

Data from: Input and results from boosted regression tree and artificial...

Type 1 error rates using all pathway information, i.e., no nodes or edges...

Data from: The limits of the constant-rate birth-death prior for...

fmott/context_dependent_planning: v1.0

Geographically Varying Correlates of Car Non-Ownership in Census Output...

Type 1 error rates using pathways with 5% missing edges.

Data from: ICARIA: spatially distributed climate projections from...

Numerical data used to generate all graphs and figures.

Study Hours ,Student Scores for Linear Regression

Study Hours and Student Scores Dataset for Linear Regression Analysis