15 datasets found
  1. Study Hours ,Student Scores for Linear Regression

    • kaggle.com
    Updated Sep 23, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    douaa bennoune (2024). Study Hours ,Student Scores for Linear Regression [Dataset]. https://www.kaggle.com/datasets/douaabennoune/study-hours-student-scores-for-linear-regression
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 23, 2024
    Dataset provided by
    Kaggle
    Authors
    douaa bennoune
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.

    Each row in the dataset consists of:

    Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:

    Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.

  2. Supplement 1. R code used to perform simulations.

    • wiley.figshare.com
    html
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sarah Emerson; Charlotte Wickham; Kenneth J. Ruzicka Jr. (2023). Supplement 1. R code used to perform simulations. [Dataset]. http://doi.org/10.6084/m9.figshare.3562644.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Sarah Emerson; Charlotte Wickham; Kenneth J. Ruzicka Jr.
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List simulations.R (MD5: bdda5503ab6ec0d1374d340d60f562d6)

      Description
        This comment used the attached R script to conduct simulation studies of spatial component regression (SCR). The file simulations.R contain all code needed to run the simulations to test SCR performance for three objectives: (1) inference under the null hypothesis; (2) inference when the predictor of inference does have an effect on the outcome; and (3) matrix selections. The code will simulate 16 sets of 1000 data sets each. The 16 data sets represent all possible combinations of 2 different spatial predictor types 4 autocorrelation types and 2 effect sizes for the spatial predictor.
    
  3. Data from: The improbability of detecting trade-offs and some practical...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Johnson (2024). The improbability of detecting trade-offs and some practical solutions [Dataset]. http://doi.org/10.5061/dryad.xpnvx0kq5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    University of Toronto
    Authors
    Marc Johnson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Trade-offs are a fundamental concept in evolutionary biology because they are thought to explain much of nature’s biological diversity, from variation in life-histories to differences in metabolism. Despite the predicted importance of trade-offs, they are notoriously difficult to detect. Here we contribute to the existing rich theoretical literature on trade-offs by examining how the shape of the distribution of resources or metabolites acquired in an allocation pathway influences the strength of trade-offs between traits. We further explore how variation in resource distribution interacts with two aspects of pathway complexity (i.e., the number of branches and hierarchical structure) affects tradeoffs. We simulate variation in the shape of the distribution of a resource by sampling 106 individuals from a beta distribution with varying parameters to alter the resource shape. In a simple “Y-model” allocation of resources to two traits, any variation in a resource leads to slopes less than -1, with left skewed and symmetrical distributions leading to negative relationships between traits, and highly right skewed distributions associated with positive relationships between traits. Adding more branches further weakens negative and positive relationships between traits, and the hierarchical structure of pathways typically weakens relationships between traits, although in some contexts hierarchical complexity can strengthen positive relationships between traits. Our results further illuminate how variation in the acquisition and allocation of resources, and particularly the shape of a resource distribution and how it interacts with pathway complexity, makes it challenging to detect trade-offs. We offer several practical suggestions on how to detect trade-offs given these challenges. Methods Overview of Flux Simulations To study the strength and direction of trade-offs within a population, we developed a simulation of flux in a simple metabolic pathway, where a precursor metabolite emerging from node A may either be converted to metabolic products B1 or B2 (Fig. 1). This conception of a pathway is similar to De Jong and Van Noordwijk’s Y-model (Van Noordwijk & De Jong, 1986; De Jong & Van Noordwijk, 1992), but we used simulation instead of analytical statistical models to allow us to consider greater complexity in the distribution of variables and pathways. For a simple pathway (Fig. 1), the total flux Jtotal (i.e., the flux at node A, denoted as JA) for each individual (N = 106) was first sampled from a predetermined beta distribution as described below. The flux at node B1 (JB1) was then randomly sampled from this distribution with max = Jtotal = JA and min = 0. The flux at the remaining node, B2, was then simply the remaining flux (JB2 = JA - JB1). Simulations of more complex pathways followed the same basic approach as described above, with increased numbers of branches and hierarchical levels added to the pathway as described below under Question 2. The metabolic pathways were simulated using Python (v. 3.8.2) (Van Rossum & Drake Jr., 2009) where we could control the underlying distribution of metabolite allocation. The output flux at nodes B1 and B2 was plotted using R (v. 4.2.1) (Team, 2022) with the resulting trade-off visualized as a linear regression using the ggplot2 R package (v. 3.4.2) (Wickham, 2016). While we have conceptualized the pathway as the flux of metabolites, it could be thought of as any resource being allocated to different traits. Question 1: How does variation in resource distribution within a population affect the strength and direction of trade-offs? We first simulated the simplest scenario where all individuals had the same total flux Jtotal = 1, in which case the phenotypic trade-off is expected to be most easily detected. We then modified this initial scenario to explore how variation in the distribution of resource acquisition (Jtotal) affected the strength and direction of trade-offs. Specifically, the resource distribution was systematically varied by sampling n = 103 total flux levels from a beta distribution, which has two parameters alpha and beta that control the size and shape of the distribution (Miller & Miller, 1999). When alpha is large and beta is small, the distribution is left skewed, whereas for small alpha and large beta, the distribution is right skewed. Likewise, for alpha = beta, the curve is symmetrical and approximately normal when the parameters are sufficiently large (>2). We can thus systematically vary the underlying resource distribution of a population by iterating through values of alpha and beta from 0.5 to 5 (in increments of 0.5), which was done using the NumPy Python package (v. 1.19.1) (Harris et al., 2020). The resulting slope of each linear regression of the flux at B1 and B2 (i.e., the two branching nodes) was then calculated using the lm function in R and plotted as a contour map using the latticeExtra Rpackage (v. 0.6-30) (Sarkar, 2008). Question 2: How does the complexity of the pathway used to produce traits affect the strength and direction of trade-offs? Metabolic pathways are typically more complex than what is described above. Most pathways consist of multiple branch points and multiple hierarchical levels. To understand how complexity affects the ability to detect trade-offs when combined with variation in the distribution of total flux we systematically manipulated the number of branch points and hierarchical levels within pathways (Fig. 1). We first explored the effect of adding branches to the pathway from the same node, such that instead of only branching off to nodes B1 and B2, the pathway branched to nodes B1 through to Bn (Fig. 1B), where n is the total number of branches (maximum n = 10 branches). Flux at a node was calculated as previously described, and the remaining flux was evenly distributed amongst the remaining nodes (i.e., nodes B2 through to Bnwould each receive J2-n = (Jtotal - JB1)/(n - 1) flux). For each pathway, we simulated flux using a beta distribution of Jtotalwith alpha = 5, beta = 0.5 to simulate a left skewed distribution, alpha = beta = 5 to simulate a normal distribution, and with alpha = 0.5, beta = 5 to simulate a right skewed distribution, as well as the simplest case where all individuals have total flux Jtotal = 1. We next considered how adding hierarchical levels to a metabolic pathway affected trade-offs. We modified our initial pathway with node A branching to nodes B1 and B2, and then node B2 further branched to nodes C1 and C2 (Fig. 1C). To compute the flux at the two new nodes C1 and C2, we simply repeated the same calculation as before, but using the flux at node B2, JB2, as the total flux. That is, the flux at node C1 was obtained by randomly sampling from the distribution at B2 with max = JB and min = 0, and the flux at node C2 is the remaining flux (JC = JB2 - JC1). Much like in the previous scenario with multiple branch points, we used three beta distributions (with the same parameters as before) to represent left, normal, and right skewed resource distributions, as well as the simplest case where Jtotal = 1 for all individuals. Quantile Regressions We performed quantile regression to understand whether this approach could help to detect trade-offs. Quantile regression is a form of statistical analysis that fits a curve through upper or lower quantiles of the data to assess whether an independent variable potentially sets a lower or upper limit to a response variable (Cade et al., 1999). This type of analysis is particularly useful when it is thought that an independent variable places a constraint on a response variable, yet variation in the response variable is influenced by many additional factors that add “noise” to the data, making a simple bivariate relationship difficult to detect (Thomson et al., 1996). Quantile regression is an extension of ordinary least squares regression, which regresses the best fitting line through the 50th percentile of the data. In addition to performing ordinary least squares regression for each pairwise comparison between the four nodes (B1, B2, C1, C2), we performed a series of quantile regressions using the ggplot2 R package (v. 3.4.2), where only the qth quantile was used for the regression (q = 0.99 and 0.95 to 0.5 in increments of 0.05, see Fig. S1) (Cade et al., 1999).

  4. Supplement 1. R code to compute model-averaged regression coefficients for...

    • wiley.figshare.com
    html
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Brian S. Cade (2023). Supplement 1. R code to compute model-averaged regression coefficients for Burnham and Anderson (2002) college gpa example and to simulate multi-part compositional predictors and model-averaged estimates similar to Rice et al (2013) zero-truncated Poisson regression count model for Greater Sage-Grouse. [Dataset]. http://doi.org/10.6084/m9.figshare.3562905.v1
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Brian S. Cade
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    File List Supplement1.txt (MD5: faff226df1ad4adfa82f2ca1700d5010) Supplement2.txt (MD5: 51eeb647483834e7ea0742d81ffc4372)

      Description
        Supplement1.txt has R script to read in the college GPA example data from Burnham and Anderson 2002:226, estimates the 16 linear least squares regression models, computes AICc, AICc weights, variance inflation factors, partial standard deviations of predictors, standardizes estimates by partial standard deviations, computes model-averaged standardized estimates and their standard errors, and computes the model-averaged ratio of t statistics for unstandardized estimates (equivalent to model-averaged ratio of standardized estimates). The code is written to be transparent with respect to the mathematical operations rather than for efficiency.
    
        Supplement2.txt has R script to generate the simulations in Appendix B for multi-part compositional predictors within a zero-truncated Poisson regression count model similar to the breeding sage-grouse count model of Rice et al. (2013).
    
  5. Type 1 error rates using pathways with dropped nodes.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using pathways with dropped nodes. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. “Perfect” indicates calculating from the graph without changing edges between remaining nodes. “Mismatch” indicates the percentage of direct edges between remaining nodes that were incorrect. “Complete Mismatch” indicates 100% mismatch.

  6. Epistasis Test in Meta-Analysis: A Multi-Parameter Markov Chain Monte Carlo...

    • plos.figshare.com
    pdf
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chin Lin; Chi-Ming Chu; Sui-Lung Su (2023). Epistasis Test in Meta-Analysis: A Multi-Parameter Markov Chain Monte Carlo Model for Consistency of Evidence [Dataset]. http://doi.org/10.1371/journal.pone.0152891
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Chin Lin; Chi-Ming Chu; Sui-Lung Su
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Conventional genome-wide association studies (GWAS) have been proven to be a successful strategy for identifying genetic variants associated with complex human traits. However, there is still a large heritability gap between GWAS and transitional family studies. The “missing heritability” has been suggested to be due to lack of studies focused on epistasis, also called gene–gene interactions, because individual trials have often had insufficient sample size. Meta-analysis is a common method for increasing statistical power. However, sufficient detailed information is difficult to obtain. A previous study employed a meta-regression-based method to detect epistasis, but it faced the challenge of inconsistent estimates. Here, we describe a Markov chain Monte Carlo-based method, called “Epistasis Test in Meta-Analysis” (ETMA), which uses genotype summary data to obtain consistent estimates of epistasis effects in meta-analysis. We defined a series of conditions to generate simulation data and tested the power and type I error rates in ETMA, individual data analysis and conventional meta-regression-based method. ETMA not only successfully facilitated consistency of evidence but also yielded acceptable type I error and higher power than conventional meta-regression. We applied ETMA to three real meta-analysis data sets. We found significant gene–gene interactions in the renin–angiotensin system and the polycyclic aromatic hydrocarbon metabolism pathway, with strong supporting evidence. In addition, glutathione S-transferase (GST) mu 1 and theta 1 were confirmed to exert independent effects on cancer. We concluded that the application of ETMA to real meta-analysis data was successful. Finally, we developed an R package, etma, for the detection of epistasis in meta-analysis [etma is available via the Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/etma/index.html].

  7. Study Hours vs Grades Dataset

    • kaggle.com
    zip
    Updated Oct 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrey Silva (2025). Study Hours vs Grades Dataset [Dataset]. https://www.kaggle.com/datasets/andreylss/study-hours-vs-grades-dataset
    Explore at:
    zip(33964 bytes)Available download formats
    Dataset updated
    Oct 12, 2025
    Authors
    Andrey Silva
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    This synthetic dataset contains 5,000 student records exploring the relationship between study hours and academic performance.

    Dataset Features

    • student_id: Unique identifier for each student (1-5000)
    • study_hours: Hours spent studying (0-12 hours, continuous)
    • grade: Final exam score (0-100 points, continuous)

    Potential Use Cases

    • Linear regression modeling and practice
    • Data visualization exercises
    • Statistical analysis tutorials
    • Machine learning for beginners
    • Educational research simulations

    Data Quality

    • No missing values
    • Normally distributed residuals
    • Realistic educational scenario
    • Ready for immediate analysis

    Data Generation Code

    This dataset was generated using R.

    R Code

    # Set seed for reproducibility
    set.seed(42)
    
    # Define number of observations (students)
    n <- 5000
    
    # Generate study hours (independent variable)
    # Uniform distribution between 0 and 12 hours
    study_hours <- runif(n, min = 0, max = 12)
    
    # Create relationship between study hours and grade
    # Base grade: 40 points
    # Each study hour adds an average of 5 points
    # Add normal noise (standard deviation = 10)
    theoretical_grade <- 40 + 5 * study_hours
    
    # Add normal noise to make it realistic
    noise <- rnorm(n, mean = 0, sd = 10)
    
    # Calculate final grade
    grade <- theoretical_grade + noise
    
    # Limit grades between 0 and 100
    grade <- pmin(pmax(grade, 0), 100)
    
    # Create the dataframe
    dataset <- data.frame(
     student_id = 1:n,
     study_hours = round(study_hours, 2),
     grade = round(grade, 2)
    )
    
  8. d

    Data from: Input and results from boosted regression tree and artificial...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Oct 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Input and results from boosted regression tree and artificial neural network models that predict daily maximum pH and daily minimum dissolved oxygen in Upper Klamath Lake, 2005-2019 [Dataset]. https://catalog.data.gov/dataset/input-and-results-from-boosted-regression-tree-and-artificial-neural-network-models-t-2005
    Explore at:
    Dataset updated
    Oct 22, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    Upper Klamath Lake
    Description

    This data release contains the model inputs, outputs, and source code (written in R) for the boosted regression tree (BRT) and artificial neural network (ANN) models developed for four sites in Upper Klamath Lake which were used to simulate daily maximum pH and daily minimum dissolved oxygen (DO) from May 18th to October 4th in 2005-12 and 2015-19 at four sites, and to evaluate variable effects and their importance. Simulations were not developed for 2013 and 2014 due to a large amount of missing meteorological data. The sites included: 1) Williamson River (WMR), which was located in the northern portion of the lake near the mouth of the Williamson River and had a depth between 0.7 and 2.9 meters; 2) Rattlesnake Point (RPT), which was located near the southern portion of the lake and had a depth between 1.9 and 3.4 meters; 3) Mid-North (MDN), which was located in the northwest portion of the lake and a depth between 2.4 and 4.2 meters; 4) Mid-Trench (MDT) , which was located in the trench that runs along the western portion of the lake and had a depth between 13.2 and 15 meters.

  9. Type 1 error rates using all pathway information, i.e., no nodes or edges...

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using all pathway information, i.e., no nodes or edges were dropped for these simulations. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    “Perfect” indicates calculating from the graph used to generate the data. “Mismatch” indicates the percentage of direct edges that were incorrect. Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. “Complete Mismatch” indicates 100% mismatch.

  10. n

    Data from: The limits of the constant-rate birth-death prior for...

    • data.niaid.nih.gov
    • search.dataone.org
    • +2more
    zip
    Updated Nov 6, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark Poulsen Khurana; Neil Scheidwasser-Clow; Matthew Penn; Samir Bhatt; David A. Duchêne (2023). The limits of the constant-rate birth-death prior for phylogenetic tree topology inference [Dataset]. http://doi.org/10.5061/dryad.2fqz612vg
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 6, 2023
    Dataset provided by
    University of Copenhagen
    University of Oxford
    Authors
    Mark Poulsen Khurana; Neil Scheidwasser-Clow; Matthew Penn; Samir Bhatt; David A. Duchêne
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Birth-death models are stochastic processes describing speciation and extinction through time and across taxa and are widely used in biology for inference of evolutionary timescales. Previous research has highlighted how the expected trees under constant-rate birth-death (crBD) tend to differ from empirical trees, for example with respect to the amount of phylogenetic imbalance. However, our understanding of how trees differ between crBD and the signal in empirical data remains incomplete. In this Point of View, we aim to expose the degree to which crBD differs from empirically inferred phylogenies and test the limits of the model in practice. Using a wide range of topology indices to compare crBD expectations against a comprehensive dataset of 1189 empirically estimated trees, we confirm that crBD trees frequently differ topologically compared with empirical trees. To place this in the context of standard practice in the field, we conducted a meta-analysis for a subset of the empirical studies. When comparing studies that used crBD priors with those that used other non-BD Bayesian and non-Bayesian methods, we do not find any significant differences in tree topology inferences. To scrutinize this finding for the case of highly imbalanced trees, we selected the 100 trees with the greatest imbalance from our dataset, simulated sequence data for these tree topologies under various evolutionary rates, and re-inferred the trees under maximum likelihood and using crBD in a Bayesian setting. We find that when the substitution rate is low, the crBD prior results in overly balanced trees, but the tendency is negligible when substitution rates are sufficiently high. Overall, our findings demonstrate the general robustness of crBD priors across a broad range of phylogenetic inference scenarios but also highlight that empirically observed phylogenetic imbalance is highly improbable under crBD, leading to systematic bias in data sets with limited information content. Methods Empirical trees used in the study are trees from the literature, collected by TimeTree (timetree.org). Run Tree_Selection.R to select the empirical phylogenetic trees to be included from TimeTree. The output file final_timetrees.RData contains the final subset of empirical phylogenetic TimeTree trees used for analysis with anonymized tip labels. 2. Run Simulation_And_Analysis.R to fit birth and death parameters (assuming rho = 1) for each of the 1189 empirical trees, simulate 1000 trees per empirical tree, calculate tree index values for both empirical and simulated trees, and calculate z-scores comparing the simulated and empirical trees. Note that calculating the tree index values for the simulated trees is VERY time-consuming due to the number of trees. Run Supplementary_Fig_S1_Analysis.R to generate data for Supplementary Figure S1. 3. Run Meta_analysis.R to run the linear regression models to investigate the role of the prior/analysis type for the subset (n=300) of the included empirical trees. The metadata for the 300 trees can be found in the supplementary files (Table S3). 4. Run Imbalance_Simulation.R to run the simulations for the imbalanced data subset (100 trees). Simulated sequences for each tree were run through RevBayes and IQ-TREE 2, as mentioned previously. Note: To avoid later confusion, the three various substitution rates used (0.5, 0.05, 0.005) are referred to as Rates 2-4 in the code. There is therefore no Rate 1; apologies in advance for any confusion. The shell scripts to run the inferences in each software are as follows: 4a. RevBayes: fasta_to_revbayes_code_rate2.sh, fasta_to_revbayes_code_rate3.sh, fasta_to_revbayes_code_rate4.sh These shell scripts use the following .Rev files: MCMC_Revbayes_code_rate2.Rev, MCMC_Revbayes_code_rate3.Rev, MCMC_Revbayes_code_rate4.Rev And rely on the following supplementary .Rev files: tree_BD.Rev, sub_JC.Rev, clock_global.Rev 4b. IQ-TREE 2: fasta_to_iqtree_code.sh 5. Run Final_Figures.R to visualize the results.

  11. fmott/context_dependent_planning: v1.0

    • zenodo.org
    zip
    Updated May 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    fmott; fmott (2022). fmott/context_dependent_planning: v1.0 [Dataset]. http://doi.org/10.5281/zenodo.5112966
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 9, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    fmott; fmott
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Contains behavioural raw data, fMRI statistical maps and analysis scripts to reproduce the result of the study.

    Description

    • Data

      • behaviour: Raw behavioural data data of all 40 participants.

      • fmri: fMRI statistical maps underlying Figure 4, Figure 5 and Tables S2-S5. Contains individual contrast images for all 40 participants and group level unthresholded T-maps. Also contains masks of the dorsal anterior cingulate cortex (dACC) and the dorsolateral prefrontal cortex (dlPFC), used for small volume correction.

      • model: Contains files with posterior samples and summary statistics of the three computational models Planning (PM), Simple (SM) and Hybrid (HM) and the reaction time analysis. Contains posterior predictive simulations for HM. Also contains the leave-one-out information criterion (LOOIC) for the PM, SM and HM used for model comparison.

    • Code

      • generate_figure2.py: Generates Figure 2.

      • generate_figure3.py: Generates Figure 3.

      • generate_S2_figure.py: Generates S2 Figure.

      • generate_S1_table.R: Logistic regression of choice against task features underlying S1 table.

      • fit_choice_models.ipynb: Jupyter notebook implementing model fitting and computation of LOOIC for the three models PM, SM and HM.

      • fit_RT_model.ipynb: Jupyter notebook implementing hierarchical Bayesian linear regression of response times against conflict and context.

      • parameter_recovery.ipynb: Jupyter notebook implementing parameter recovery analysis for model HM.

      • backward_induction.py: Function used to compute expected long-term values via backward induction.

      • simulate_agents.py: Simulate behavior of agents using a planning, simple, hybrid or random strategy.

  12. u

    Geographically Varying Correlates of Car Non-Ownership in Census Output...

    • datacatalogue.ukdataservice.ac.uk
    Updated Apr 2, 2009
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Harris, R., University of Bristol, School of Geographical Sciences; Grose, D., Lancaster University, Centre for e-Science (2009). Geographically Varying Correlates of Car Non-Ownership in Census Output Areas of England, 2001 [Dataset]. http://doi.org/10.5255/UKDA-SN-6100-1
    Explore at:
    Dataset updated
    Apr 2, 2009
    Dataset provided by
    UK Data Servicehttps://ukdataservice.ac.uk/
    Authors
    Harris, R., University of Bristol, School of Geographical Sciences; Grose, D., Lancaster University, Centre for e-Science
    Area covered
    England
    Description

    Standard indexes of poverty and deprivation are rarely sensitive to how the causes and consequences of deprivation have different impacts depending upon where a person lives. More geographically minded approaches are alert to spatial variations but are also difficult to compute using desktop PCs.

    The aim of the ESRC sponsored project was to develop a method of spatial analysis known as ‘geographically weighted regression’ (GWR) to run in the high power computing environment offered by ‘Grid computation’ and e-social science. GWR, like many other methods of spatial analysis, is characterised by multiple repeat testing as the data are divided into geographical regions and also randomly redistributed many times to simulate the likelihood that the results obtained from the analysis are actually due to chance. Each of these tests requires computer time so, given a large dataset such as the UK Census statistics, running the analysis on a standard machine can take a long time! Fortunately, the computational grid is not standard but offers the possibility to speed up the process by running GWR’s sequences of calibration, analysis and non-parametric simulation in parallel.

    An output is a model of the geographically varying correlates of car non-ownership fitted for the 165,665 Census Output Areas in England. Specifically, a geographically weighted regression of the relationship between the proportion of households without a car (or van) in 2001 (the dependent variable), and the following predictor variables: proportion of persons of working age unemployed; proportion of households in public housing; proportion of households that are lone parent households; proportion of persons 16 or above that are single; and proportion of persons that are white British.

    Note - the file does not contain Census 2001 data, only National Grid references and regression coefficients.

    Further information is available from the Grid Enabled Spatial Regression Models (With Application to Deprivation Indices) web page.

  13. Type 1 error rates using pathways with 5% missing edges.

    • plos.figshare.com
    xls
    Updated Jun 9, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh (2023). Type 1 error rates using pathways with 5% missing edges. [Dataset]. http://doi.org/10.1371/journal.pcbi.1008986.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Charlie M. Carpenter; Weiming Zhang; Lucas Gillenwater; Cameron Severn; Tusharkanti Ghosh; Russell Bowler; Katerina Kechris; Debashis Ghosh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.12, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. “Perfect” indicates calculating from the graph without changing remaining edges. “Mismatch” indicates the percentage of remaining direct edges that were incorrect. “Complete Mismatch” indicates 100% mismatch.

  14. u

    Data from: ICARIA: spatially distributed climate projections from...

    • produccioncientifica.ucm.es
    • data.niaid.nih.gov
    • +2more
    Updated 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.; Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E. (2024). ICARIA: spatially distributed climate projections from statistical downscaling [Dataset]. https://produccioncientifica.ucm.es/documentos/67321d3daea56d4af0484391
    Explore at:
    Dataset updated
    2024
    Authors
    Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.; Paradinas, C; Prado, C.; Galiano, L.; Redolat, Darío; Monjo, R.; Gaitan, E.
    Description

    ICARIA project had as one of its main purposes to develop coherent, reliable and usable downscaled climate projections from the last CMIP6 in order to construct the basis for efficient support to climate adaptation and decision-making of the related stakeholders, supporting the adaptation of critical assets within the project. These projections were obtained with also the purpose of being freely available for further use in subsequent studies and, hence, foster adaptation to climate change in more areas. Therefore, ICARIA’s climate information is already based on CMIP6 models and incorporating in its workflow the current SSPs. The presented high-resolution future climate projections display a unique dataset, being obtained from a high-quality and high-density set of weather observations that are then interpolated to the case studies of interest in a 100x100m resolution grid, which is the main outcome offered in this publication. These models will provide the scenarios to be considered within the Risk Assessment and the design and development of all adaptation measures coming as ICARIA outcomes.

    For further details, find here a brief of the methodology followed:

    The statistical downscaling methodology applied in ICARIA by FIC, named FICLIMA (Ribalaygua et al. 2013), consists of a two-step analogue/regression statistical method which has been used in national and international projects with good verification results (i.e.: Monjo et al. 2016). The first step is common for all simulated climate variables and it is based on an analogue stratification (Zorita et al. 1993). An analogue method was applied based on the hypothesis that ‘analogue’ atmospheric patterns (predictors) should cause analogue local effects (predictands), which means that the number of days that were most similar to the day to be downscaled was selected. The similarity between any two days was measured according to three nested synoptic windows (with different weights) and four large-scale fields using a pseudo-Euclidean distance between the large-scale fields used as predictors. For each predictor, the weighted Euclidean distance was calculated and standardised by substituting it with the closest percentile of a reference population of weighted Euclidean distances for that predictor. This method is a good method for reproducing nonlinear relationships between predictors and the predictands, but it could not be used to simulate values outside of the range of observed values. In order to overcome this problem and obtain a better simulation, a second step was required.

    For this second step, the procedures applied depend on the variable of interest. To determine the temperature, multiple linear regression analysis for the selected number of most analogous days was performed for each station and for each problem day. From a group of potential predictors, the linear regression selected those with the highest correlation, using a forward and backward stepwise approach.

    For precipitation, a group of m problem days (we use the whole days of a month) is downscaled. For each problem day we obtain a “preliminary precipitation amount” averaging the rain amount of its n most analogous days, so we can sort the m problem days from the highest to the lowest “preliminary precipitation amount”. For assigning the final precipitation amount, all amounts of the m×n analogous days are sorted and clustered in m groups. Every quantity is finally assigned, orderly, to the m days previously sorted by the “preliminary precipitation amount”.

    For wind or relative humidity, the second step is a transfer function between the observed probability distribution and the simulated one using the averaged values from the n = 30 analogous days. Particularly, a parametric bias correction was performed to the time series obtained from the analogue stratification (first step). In order to estimate the improvement of this procedure, the bias correction was also applied to the direct model outputs.

    This second step done at a daily scale with an inner thorough verification procedure is essential and the main differentiating process of FICLIMA method. It extends beyond mean values to include extremes and covers all time scales, including daily intervals. With the verification it can be proven If the method correctly simulates changes from one day to the next, indicating an effective capture of the underlying physical connections between predictors and predictands. These physical links remain relatively consistent, even in the face of climate change (as opposed to purely empirical relationships that might shift). In essence, this approach theoretically addresses the primary challenge in statistical downscaling known as the non-stationarity problem. This problem questions the stability of predictor/predictand relationships established in the past, probing whether these relationships will persist in the future.

    The dataset shared here includes information for the three case studies tackled in ICARIA: Barcelona Metropolitan Area (AMB), Salzburg Region (SLZ), and South Aegean Region (SAR). The information provided covers data and outcomes by 10 models belonging to CMIP6. Each model has a historical archive, from 01/01/1950 to 31/12/2014 and 4 future scenarios (ssp126, ssp245, ssp370 and ssp585) ranging from 01/01/2015 to 31/12/2100. The relation of the selected models is detailed in the next Table:

    Table 1. Information about the 10 climate models belonging to the 6 Coupled Model Intercomparison Project (CMIP6) corresponding to the IPCC AR6. Models were retrieved from the Earth System Grid Federation (ESGF) portal in support of the Program for Climate Model Diagnosis and Intercomparison (PCMDI).

    CMIP6 MODELS

    Resolution

    Responsible Centre

    References

    ACCESS-CM2

    1,875º x 1,250º

    Australian Community Climate and Earth System Simulator (ACCESS), Australia

    Bi, D. et al (2020)

    BCC-CSM2-MR

    1,125º x 1,121º

    Beijing Climate Center (BCC), China Meteorological Administration, China.

    Wu T. et al. (2019)

    CanESM5

    2,812º x 2,790º

    Canadian Centre for Climate Modeling and Analysis (CC-CMA), Canadá.

    Swart, N.C. et al. (2019)

    CMCC-ESM2

    1,000º x 1,000º

    Centro Mediterraneo sui Cambiamenti Climatici (CMCC).

    Cherchi et al, 2018

    CNRM-ESM2-1

    1,406º x 1,401º

    CNRM (Centre National de Recherches Meteorologiques), Meteo-France, Francia.

    Seferian, R. (2019)

    EC-EARTH3

    0,703º x 0,702º

    EC-EARTH Consortium

    EC-Earth Consortium. (2019)

    MPI-ESM1-2-HR

    0,938º x 0,935º

    Max-Planck Institute for Meteorology (MPI-M), Germany.

    Müller et al., (2018)

    MRI-ESM2-0

    1,125º x 1,121º

    Meteorological Research Institute (MRI), Japan.

    Yukimoto, S. et al. (2019)

    NorESM2-MM

    1,250º x 0,942º

    Norwegian Climate Centre (NCC), Norway.

    Bentsen, M. et al. (2019)

    UKESM1-0-LL

    1,875º x 1,250º

    UK Met Office, Hadley Centre, United Kingdom

    Good, P. et al. (2019)

    The climate projections have been developed over each of the observational locations that were retrieved to run the statistical downscaling. The results from these projections have been spatially interpolated into a 100x100m grid with a Multi-lineal Regression Model considering diverse adjustments and topographic corrections. The results presented here are the median of the 10 models used, obtained for each of the 4 SSPs and each of the time periods considered in ICARIA until the year 2100. The variables treated belong to the main climate variables and their related extreme indicators as they were defined during the ICARIA project. You can find here a summary table of all the variables and indicators that were used to develop the projections. Table 2. Summary of selected thermal and precipitation indicators, grouped aligned with the main hazards they feed. “nd” = number of days; “ne” = number of events.

    Index/name

    Short description

    Source

    Variable

    Units

    Threshold

    Thermal indicators

    TX90 / TX10

    Warm/cold days

    Zhang et al. (2011)

    TX

    nd

    90 / 10%

    HD

    Heat day

    ICARIA

    TX

    nd

    30 °C

    EHD

    Extreme heat day

    ICARIA

    TX

    nd

    35 °C

    TR

    Tropical nights

    Zhang et al. (2011)

    TN

    nd

    20 °C

    EQ

    Equatorial nights

    AEMet 2020, ICARIA

    TN

    nd

    25 °C

    IN

    Infernal nights

    ICARIA

    TN

    nd

    30 °C

    FD

    Frost days

    Zhang et al. (2011)

    TN

    nd

    < 0 °C

    Max consec

    Max spell length for above thermal indicators

    ICARIA

    -

    nd

    -

    Nº events

    Number of above thermal indicators events

    ICARIA

    -

    ne

    3 days

    TXm

    Mean maximum temperatures

    ICARIA

    TX

    °C

    -

    TNm

    Mean minimum temperatures

    ICARIA

    TN

    °C

    -

    TM

    Mean temperatures

    ICARIA

    TA

    °C

    -

    HWle

    Heatwave length

    ICARIA

    TX

    nd

    3d > 95% TX

    HWim/HWix

    Mean and maximum heatwave intensity

    ICARIA

    TX

    °C

    3d > 95% TX

    HWf

    Heatwave frequency

    ICARIA

    TX

    ne

    3d > 95% TX

    HWd

    Heatwave days

    ICARIA

    TX

    nd

    3d > 95% TX

    HI - P90

    Heat Index (percentile 90)

    NWS (1994)

    TX, RH

    °C

    TX>27 °C, HR> 40%

    UTCI

    Universal Thermal Climate Index

    Bröde et al. (2012)

    TARH, W

    -

    -

    UHI

    Isla de calor (BCN) anual y estacional

    AMB, Metrobs 2015

    T

    °C

    TM1-TM2 > 0 °C

    Precipitation indicators

    R20

    Number of heavy precipitation days

    Zhang et al. (2011)

    P

    nd

    20 mm

    R50, R100

    Days with extreme heavy rain

    AMB et al. (2017)

    P

    nd

    50mm

    100mm

    Ra

    Yearly and seasonal rainfall relative change

    ICARIA

    P

    mm

    ≥ 0.1mm

    IDF - CCF

    IDF Curves - Climate Change Factor

    Arnbjerg-Nielsen (2012)

    P

    -

    ≥ 0.1mm

    Forest fire

  15. Numerical data used to generate all graphs and figures.

    • plos.figshare.com
    xlsx
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vijaykumar S. Jatti; A. Saiyathibrahim; Arvind Yadav; Murali Krishnan R.; B. Jayaprakash; Sumit Kaushal; Vinaykumar S. Jatti; Ashwini V. Jatti; Savita V. Jatti; Abhinav Kumar; Soumaya Gouadria; Ebenezer Bonyah (2025). Numerical data used to generate all graphs and figures. [Dataset]. http://doi.org/10.1371/journal.pone.0324049.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Jun 2, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Vijaykumar S. Jatti; A. Saiyathibrahim; Arvind Yadav; Murali Krishnan R.; B. Jayaprakash; Sumit Kaushal; Vinaykumar S. Jatti; Ashwini V. Jatti; Savita V. Jatti; Abhinav Kumar; Soumaya Gouadria; Ebenezer Bonyah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Numerical data used to generate all graphs and figures.

  16. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
douaa bennoune (2024). Study Hours ,Student Scores for Linear Regression [Dataset]. https://www.kaggle.com/datasets/douaabennoune/study-hours-student-scores-for-linear-regression
Organization logo

Study Hours ,Student Scores for Linear Regression

Study Hours and Student Scores Dataset for Linear Regression Analysis

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 23, 2024
Dataset provided by
Kaggle
Authors
douaa bennoune
License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

This dataset contains a collection of 100 randomly generated data points representing the relationship between the number of hours a student spends studying and their corresponding performance, measured as a score. The data has been generated to simulate a real-world scenario where study hours are assumed to influence academic outcomes, making it an excellent resource for linear regression analysis and other machine learning tasks.

Each row in the dataset consists of:

Hours: The number of hours a student dedicates to studying, ranging between 0 and 10 hours. Scores: The student's performance score, represented as a percentage, ranging from 0 to 100. Use Cases: This dataset is particularly useful for:

Linear Regression: Exploring how study hours influence student performance, fitting a regression line to predict scores based on study time. Data Science & Machine Learning: Practicing regression analysis, training models, and applying other predictive algorithms. Educational Research: Simulating data-driven insights into student behavior and performance metrics. Features: 100 rows of data. Continuous numerical variables suitable for regression tasks. Generated for educational purposes, making it ideal for students, teachers, and beginners in machine learning and data science. Potential Applications: Build a linear regression model to predict student scores. Investigate the correlation between study time and performance. Apply data visualization techniques to better understand the data. Use the dataset to experiment with model evaluation metrics like Mean Squared Error (MSE) and R-squared.

Search
Clear search
Close search
Google apps
Main menu