100+ datasets found

Data from: Improving structured population models with more realistic...
zenodo.org
data.niaid.nih.gov
+3more
Updated Jun 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak; Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak (2022). Data from: Improving structured population models with more realistic representations of non-normal growth [Dataset]. http://doi.org/10.5061/dryad.t6c3573
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.t6c3573
Dataset updated
Jun 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak; Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction, and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet almost all published IPMs make an important assumption: that size-dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage. 2. Here we develop a flexible approach to modeling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size-dependent minimum and maximum sizes through quantile regression. Transformed data can then be modeled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long-lived plant, gorgonians, and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model. 3. The beta and skewed normal distributions accurately capture the mean, variance, and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated lifespan relative to IPMs that assume normally-distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modeled size range. 4. Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.
f
Data from: Clustering Spatial Data with a Mixture of Skewed Regression...
tandf.figshare.com
pdf
Updated May 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junho Lee; Michael P. B. Gallaugher; Amanda S. Hering (2025). Clustering Spatial Data with a Mixture of Skewed Regression Models [Dataset]. http://doi.org/10.6084/m9.figshare.28454482.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.28454482.v1
Dataset updated
May 12, 2025
Dataset provided by
Taylor & Francis
Authors
Junho Lee; Michael P. B. Gallaugher; Amanda S. Hering
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A single regression model is unlikely to hold throughout a large and complex spatial domain. A finite mixture of regression models can address this issue by clustering the data and assigning a regression model to explain each homogenous group. However, a typical finite mixture of regressions does not account for spatial dependencies. Furthermore, the number of components selected can be too high in the presence of skewed data and/or heavy tails. Here, we propose a mixture of regression models on a Markov random field with skewed distributions. The proposed model identifies the locations wherein the relationship between the predictors and the response is similar and estimates the model within each group as well as the number of groups. Overfitting is addressed by using skewed distributions, such as the skew-t or normal inverse Gaussian, in the error term of each regression model. Model estimation is carried out using an EM algorithm, and the performance of the estimators and model selection are illustrated through an extensive simulation study and two case studies.
d
Replication Data for: Accounting for Skewed or One-Sided Measurement Error...
dataone.org
Updated Nov 22, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Millimet, Daniel; Parmeter, Christopher (2023). Replication Data for: Accounting for Skewed or One-Sided Measurement Error in the Dependent Variable [Dataset]. http://doi.org/10.7910/DVN/IKSE2O
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/IKSE2O
Dataset updated
Nov 22, 2023
Dataset provided by
Harvard Dataverse
Authors
Millimet, Daniel; Parmeter, Christopher
Description
While classical measurement error in the dependent variable in a linear regression framework results only in a loss of precision, nonclassical measurement error can lead to estimates which are biased and inference which lacks power. Here, we consider a particular type of nonclassical measurement error: skewed errors. Unfortunately, skewed measurement error is likely to be a relatively common feature of many out- comes of interest in political science research. This study highlights the bias that can result even from relatively "small" amounts of skewed measurement error, particularly if the measurement error is heteroskedastic. We also assess potential solutions to this problem, focusing on the stochastic frontier model and nonlinear least squares. Simulations and three replications highlight the importance of thinking carefully about skewed measurement error, as well as appropriate solutions.
f
Median and IQR of skewed data for CRP.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Sep 25, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hollenbeck-Pringle, Danielle; Stark, Paul; Rich, David Q.; Balmes, John R.; Dagincourt, Nicholas; Hazucha, Milan J.; Bromberg, Philip A.; Costantini, Maria G.; Arjomandi, Mehrdad; Frampton, Mark W. (2019). Median and IQR of skewed data for CRP. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000063639
Explore at:
Dataset updated
Sep 25, 2019
Authors
Hollenbeck-Pringle, Danielle; Stark, Paul; Rich, David Q.; Balmes, John R.; Dagincourt, Nicholas; Hazucha, Milan J.; Bromberg, Philip A.; Costantini, Maria G.; Arjomandi, Mehrdad; Frampton, Mark W.
Description
Median and IQR of skewed data for CRP.
n
Data from: Selection on skewed characters and the paradox of stasis
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.pt07g
Dataset updated
Sep 8, 2017
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
Additional file 2 of Modelling count, bounded and skewed continuous outcomes...
springernature.figshare.com
text/x-diff
Updated Jun 2, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 2 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774297.v1
Explore at:
text/x-diffAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22774297.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.
d
Flood Region A
catalog.data.gov
data.usgs.gov
+1more
Updated Jul 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. Geological Survey (2024). Flood Region A [Dataset]. https://catalog.data.gov/dataset/flood-region-a
Explore at:
Dataset updated
Jul 6, 2024
Dataset provided by
U.S. Geological Survey
Description
To improve flood-frequency estimates at rural streams in Mississippi, annual exceedance probability (AEP) flows at gaged streams in Mississippi and regional-regression equations, used to estimate annual exceedance probability flows for ungaged streams in Mississippi, were developed by using current geospatial data, additional statistical methods, and annual peak-flow data through the 2013 water year. The regional-regression equations were derived from statistical analyses of peak-flow data, basin characteristics associated with 281 streamgages, the generalized skew from Bulletin 17B (Interagency Advisory Committee on Water Data, 1982), and a newly developed study-specific skew for select four-digit hydrologic unit code (HUC4) watersheds in Mississippi. Four flood regions were identified based on residuals from the regional-regression analyses. No analysis was conducted for streams in the Mississippi Alluvial Plain flood region because of a lack of long-term streamflow data and poorly defined basin characteristics. Flood regions containing sites with similar basin and climatic characteristics yielded better regional-regression equations with lower error percentages. The generalized least squares method was used to develop the final regression models for each flood region for annual exceedance probability flows. The peak-flow statistics were estimated by fitting a log-Pearson type III distribution to records of annual peak flows and then applying two additional statistical methods: (1) the expected moments algorithm to help describe uncertainty in annual peak flows and to better represent missing and historical record; and (2) the generalized multiple Grubbs-Beck test to screen out potentially influential low outliers and to better fit the upper end of the peak-flow distribution. Standard errors of prediction of the generalized least-squares models ranged from 28 to 46 percent. Pseudo coefficients of determination of the models ranged from 91 to 96 percent. Flood Region A, located in north-central Mississippi, contained 27 streamgages with drainage areas that ranged from 1.41 to 612 square miles. The 1% annual exceedance probability had a standard error of prediction of 31 percent which was lower than the prediction errors in Flood Regions B and C.
u
Results and analysis using the Lean Six-Sigma define, measure, analyze,...
researchdata.up.ac.za
docx
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Modiehi Mophethe (2024). Results and analysis using the Lean Six-Sigma define, measure, analyze, improve, and control (DMAIC) Framework [Dataset]. http://doi.org/10.25403/UPresearchdata.25370374.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.25370374.v1
Dataset updated
Mar 12, 2024
Dataset provided by
University of Pretoria
Authors
Modiehi Mophethe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This section presents a discussion of the research data. The data was received as secondary data however, it was originally collected using the time study techniques. Data validation is a crucial step in the data analysis process to ensure that the data is accurate, complete, and reliable. Descriptive statistics was used to validate the data. The mean, mode, standard deviation, variance and range determined provides a summary of the data distribution and assists in identifying outliers or unusual patterns. The data presented in the dataset show the measures of central tendency which includes the mean, median and the mode. The mean signifies the average value of each of the factors presented in the tables. This is the balance point of the dataset, the typical value and behaviour of the dataset. The median is the middle value of the dataset for each of the factors presented. This is the point where the dataset is divided into two parts, half of the values lie below this value and the other half lie above this value. This is important for skewed distributions. The mode shows the most common value in the dataset. It was used to describe the most typical observation. These values are important as they describe the central value around which the data is distributed. The mean, mode and median give an indication of a skewed distribution as they are not similar nor are they close to one another. In the dataset, the results and discussion of the results is also presented. This section focuses on the customisation of the DMAIC (Define, Measure, Analyse, Improve, Control) framework to address the specific concerns outlined in the problem statement. To gain a comprehensive understanding of the current process, value stream mapping was employed, which is further enhanced by measuring the factors that contribute to inefficiencies. These factors are then analysed and ranked based on their impact, utilising factor analysis. To mitigate the impact of the most influential factor on project inefficiencies, a solution is proposed using the EOQ (Economic Order Quantity) model. The implementation of the 'CiteOps' software facilitates improved scheduling, monitoring, and task delegation in the construction project through digitalisation. Furthermore, project progress and efficiency are monitored remotely and in real time. In summary, the DMAIC framework was tailored to suit the requirements of the specific project, incorporating techniques from inventory management, project management, and statistics to effectively minimise inefficiencies within the construction project.
d
Randomized Battery Usage 7: Low-Temperature Left-Skewed Random Walk
catalog.data.gov
data.nasa.gov
Updated Apr 11, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PCoE (2025). Randomized Battery Usage 7: Low-Temperature Left-Skewed Random Walk [Dataset]. https://catalog.data.gov/dataset/randomized-battery-usage-7-low-temperature-left-skewed-random-walk
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
PCoE
Description
This dataset is part of a series of datasets, where batteries are continuously cycled with randomly generated current profiles. Reference charging and discharging cycles are also performed after a fixed interval of randomized usage to provide reference benchmarks for battery state of health. In this dataset, four 18650 Li-ion batteries (Identified as RW13, RW14, RW15 and RW16) were continuously operated by repeatedly charging them to 4.2V and then discharging them to 3.2V using a randomized sequence of discharging currents between 0.5A and 5A. This type of discharging profile is referred to here as random walk (RW) discharging. A customized probability distribution is used in this experiment to select a new load setpoint every 1 minute during RW discharging operation. The custom probability distribution was designed to be skewed towards selecting lower currents.

Loan Default Risk Prediction Dataset

kaggle.com

Updated Feb 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Himel Sarder (2025). Loan Default Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/himelsarder/loan-default-risk-prediction-dataset/versions/1

Explore at:

CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.

Dataset updated

Feb 1, 2025

Dataset provided by

Kagglehttp://kaggle.com/

Authors

Himel Sarder

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📖 Dataset Overview

This dataset is designed for financial risk assessment and loan default prediction using machine learning techniques. It includes 300 records, each representing an individual with financial attributes that influence the likelihood of loan default.

📊 Features & Data Structure

The dataset contains the following columns:

Column Name	Type	Description
Retirement_Age	float	Age at which the individual retires (left-skewed distribution).
Debt_Amount	float	Total debt held by the individual in dollars (right-skewed distribution).
Monthly_Savings	float	Average monthly savings in dollars (normally distributed).
Loan_Default_Risk	int (0/1)	Target variable: 1 = Default, 0 = No Default.

Highly Left-Skewed Column: Retirement Age – Most people retire at older ages, with fewer early retirees.
Highly Right-Skewed Column: Debt Amount – Most people have low debt, but a few have very high debt.
Totally Symmetric Column: Monthly Savings – Normally distributed around an average.

📌 Data Generation & Logic

The dataset was synthetically created using statistical distributions that mimic real-world financial behavior:

🔹 Retirement Age (Left-Skewed): Generated using a transformed normal distribution to ensure most values are high (60-85).
🔹 Debt Amount (Right-Skewed): Generated using a log-normal distribution, where most people have low debt, but a few have very high debt.
🔹 Monthly Savings (Symmetric): Normally distributed with mean $2000$ and standard deviation $500$, clipped between $500-$5000.
🔹 Loan Default Risk (Target Variable): Computed using a logistic function, where:
- Lower retirement age ⬆ default risk
- Higher debt ⬆ default risk
- Higher savings ⬇ default risk
- The probability threshold was adjusted to balance 0s and 1s.

n
Data from: Body temperature distributions of active diurnal lizards in three...
data.niaid.nih.gov
datadryad.org
zip
Updated Aug 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymond B. Huey; Eric R. Pianka (2018). Body temperature distributions of active diurnal lizards in three deserts: skewed up or skewed down? [Dataset]. http://doi.org/10.5061/dryad.45g3s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.45g3s
Dataset updated
Aug 4, 2018
Dataset provided by
University of Washington
The University of Texas at Austin
Authors
Raymond B. Huey; Eric R. Pianka
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Australia, Africa, North America
Description
The performance of ectotherms integrated over time depends in part on the position and shape of the distribution of body temperatures (Tb) experienced during activity. For several complementary reasons, physiological ecologists have long expected that Tb distributions during activity should have a long left tail (left-skewed); but only infrequently have they quantified the magnitude and direction of Tb skewness in nature.

To evaluate whether left-skewed Tb distributions are general for diurnal desert lizards, we compiled and analyzed Tb (∑ = 9,023 temperatures) from our own prior studies of active desert lizards on three continents (25 species in Western Australia, 10 in the Kalahari Desert of Africa, and 10 species in western North America). We gathered these data over several decades, using standardized techniques.

Many species showed significantly left-skewed Tb distributions, even when records were restricted to summer months. However, magnitudes of skewness were always small, such that mean Tb were never more than 1°C lower than median Tb. The significance of Tb skewness was sensitive to sample size, and power tests reinforced this sensitivity.

The magnitude of skewness was not obviously related to phylogeny, desert, body size, or median body temperature. Moreover, formal phylogenetic analysis is inappropriate because geography and phylogeny are confounded (that is, are highly collinear).

Skewness might be limited if lizards pre-warm inside retreats before emerging in the morning, emerge only when operative temperatures are high enough to speed warming to activity Tb, or if cold lizards are especially wary and difficult to spot or catch. Telemetry studies may help evaluate these possibilities.
Additional file 3 of Modelling count, bounded and skewed continuous outcomes...
springernature.figshare.com
txt
Updated Jun 2, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White (2023). Additional file 3 of Modelling count, bounded and skewed continuous outcomes in physical activity research: beyond linear regression models [Dataset]. http://doi.org/10.6084/m9.figshare.22774300.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.22774300.v1
Dataset updated
Jun 2, 2023
Dataset provided by
Figsharehttp://figshare.com/
Authors
Muhammad Akram; Ester Cerin; Karen E. Lamb; Simon R. White
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Supplementary Material 3: A supplementary file with examples of SAS script for all models that have been fitted in this paper.
d
Randomized Battery Usage 5: High Temperature Right-Skewed Random Walk
catalog.data.gov
data.nasa.gov
Updated Apr 11, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
PCoE (2025). Randomized Battery Usage 5: High Temperature Right-Skewed Random Walk [Dataset]. https://catalog.data.gov/dataset/randomized-battery-usage-5-high-temperature-right-skewed-random-walk
Explore at:
Dataset updated
Apr 11, 2025
Dataset provided by
PCoE
Description
This dataset is part of a series of datasets, where batteries are continuously cycled with randomly generated current profiles. Reference charging and discharging cycles are also performed after a fixed interval of randomized usage to provide reference benchmarks for battery state of health. In this dataset, four 18650 Li-ion batteries (Identified as RW17, RW18, RW19 and RW20) were continuously operated by repeatedly charging them to 4.2V and then discharging them to 3.2V using a randomized sequence of discharging currents between 0.5A and 5A. This type of discharging profile is referred to here as random walk (RW) discharging. A customized probability distribution is used in this experiment to select a new load setpoint every 1 minute during RW discharging operation. The custom probability distribution was designed to be skewed towards selecting higher currents.
d
Data from: Evolution of quantitative traits under a migration-selection...
datadryad.org
search.dataone.org
+1more
zip
Updated Jul 21, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Florence Débarre; Sam Yeaman; Frédéric Guillaume (2015). Evolution of quantitative traits under a migration-selection balance: when does skew matter? [Dataset]. http://doi.org/10.5061/dryad.ms52b
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.ms52b
Dataset updated
Jul 21, 2015
Dataset provided by
Dryad
Authors
Florence Débarre; Sam Yeaman; Frédéric Guillaume
Time period covered
Jul 20, 2015
Description
Folder to reproduce the figures of the articleR scripts to plot the figures of the article, and output of the individual-based simulations run with the software Nemo.scripts-Debarre-Yeaman-Guillaume_2015.zip
U
Annual peak-flow data and PeakFQ output files for selected streamflow gaging...
data.usgs.gov
catalog.data.gov
Updated Feb 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Wagner; Andrea Veilleux (2024). Annual peak-flow data and PeakFQ output files for selected streamflow gaging stations operated by the U.S. Geological Survey in the New England region that were used to estimate regional skewness of annual peak flows [Dataset]. http://doi.org/10.5066/P9MC98OM
Explore at:
Unique identifier
https://doi.org/10.5066/P9MC98OM
Dataset updated
Feb 24, 2024
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Authors
Daniel Wagner; Andrea Veilleux
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Time period covered
Sep 30, 2011
Area covered
New England
Description
"NewEngland_pkflows.PRT" is a text file that contains results of flood-frequency analysis of annual peak flows from 186 selected streamflow gaging stations (streamgages) operated by the U.S. Geological Survey (USGS) in the New England region (Maine, Connecticut, Massachusetts, Rhode Island, New York, New Hampshire, and Vermont). Only streamgages in the region that were also in the USGS "GAGES II" database (https://water.usgs.gov/GIS/metadata/usgswrd/XML/gagesII_Sept2011.xml) were considered for use in the study. The file was generated by combining PeakFQ output (.PRT) files created using version 7.0 of USGS software PeakFQ (https://water.usgs.gov/software/PeakFQ/; Veilleux and others, 2014) to conduct flood-frequency analyses using the Expected Moments Algorithm (England and others, 2018). The peak-flow files used as input to PeakFQ were obtained from the USGS National Water Information System (NWIS) database (https://nwis.waterdata.usgs.gov/usa/nwis/peak) and contained annual ...
u
Data from: Data publication: Demand-pull and technology-push: What drives...
pub.uni-bielefeld.de
Updated Mar 10, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kerstin Hötte (2023). Data publication: Demand-pull and technology-push: What drives the direction of technological change? Empirical data on a coupled two-layer input-output and patent-citation network [Dataset]. https://pub.uni-bielefeld.de/record/2952814
Explore at:
Dataset updated
Mar 10, 2023
Authors
Kerstin Hötte
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The core data in this publication are empirical data on two coupled network layers inferred from cross-industrial citation links (patent citation network) and input-output flows among industries (input-output (IO) network). These data are available in quinquennial time steps for the years 1977, 1982, 1987, 1992, 1997, 2002, 2006. The data is available at the 6-digit level. The analyses in the paper mainly refer to 4-digit level results of a balanced panel of industries, i.e. industries for which both patent and IO data are available for the full time horizon. This publication also contains a sample of panel data on industry characteristics (mainly industry size by patent stock and output and network indicators). The data are available ein RData format and supplemented by the R-scripts used to compile and analyze the data.

This data publication contains all data and R-code used for the paper:

Hötte, Kerstin (2021): "Demand-pull Demand-pull and technology-push: What drives the direction of technological change? An empirical network-based approach".

[Forthcoming. If you use the data, please cite the most recent version of the paper.]

The paper also offers a description of the data and its compilation.

Abstract:

Demand-pull and technology-push are linked to an empirical two-layer network based on coupled cross-industrial input-output (IO) and patent-citation links among 155 4-digit (NAICS) US-industries in 1976-2006. I study the evolution of industry hierarchies and link formation. Both layers co-evolve, but differently: The patent network became denser and increasingly skewed, while market hierarchies are balanced and sluggish in change. Industries became more similar by patent-citations, but less by input use. Similar R&D capabilities as other big industries is beneficial for innovation providing access to knowledge but relying on the same market inputs is unfavorable if it intensifies competition. This may incite industries to explore other technological pathways. Growth in the market is constrained by scarcity and competition, but knowledge as innovation input is non-rival leading to increasing returns and a skewed distribution. This may strengthen existing R&D trajectories while market pressure may trigger a re-direction in both layers. This work is limited by its reliance on endogenously evolving classifications.

To reproduce the results and the data from the raw data, you must run the code provided in the

following order:

(1) CREATING THE DATA: (a) The patent data can not be fully reconstructed from the data that are available in this data publication because one of the intermediate steps relies on proprietary data that can not be provided here. For the remainder: You can compile parts of the patent and the IO data from the raw data. To do so, please use the code and raw data provided in the folders io_data_R_files and patent_data_R_files. Further detail is provided below.
(b) The folder R_scripts_both provides all code needed to create the merged panel data that is used in the analysis.

(2) REPRODUCING THE ANALYSES: The folder R_script_both provides all code needed to reproduce the figures, tables, descriptive statistics and regression analyses. Further detail is provided below.

This data publication also provides additional results on the analyses at different levels of data aggregation. You will find it in the folder statistical_output but you can also produce additional results running the code provided.

This data publication consists of 6 folders:

(1) patent_data_R_files (2) io_data_R_files (3) R_scripts_both (4) data_combined (5) statistical_output

Details:

(1) patent_data_R_files

This folder contains 2 subfolders: code, data

code: This subfolder contains the R-scripts of all single steps executed to process the patent raw data. These steps are explained in detail in the Supplementary Material of the paper Hötte, K (2021): "Demand-pull and technology-push [forthcoming]"

data: This subfolder contains the processed data at different levels of aggregation and a folder where to put the source files, i.e. the original NBER patent data used in this analysis that need to be downloaded from https://sites.google.com/site/patentdataproject/Home [accessed on Mar 17, 2021]. To us
s
Data from: Data files used to study change dynamics in software systems
figshare.swinburne.edu.au
pdf
Updated Jul 22, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajesh Vasa (2024). Data files used to study change dynamics in software systems [Dataset]. http://doi.org/10.25916/sut.26288227.v1
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.25916/sut.26288227.v1
Dataset updated
Jul 22, 2024
Dataset provided by
Swinburne
Authors
Rajesh Vasa
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
It is a widely accepted fact that evolving software systems change and grow. However, it is less well-understood how change is distributed over time, specifically in object oriented software systems. The patterns and techniques used to measure growth permit developers to identify specific releases where significant change took place as well as to inform them of the longer term trend in the distribution profile. This knowledge assists developers in recording systemic and substantial changes to a release, as well as to provide useful information as input into a potential release retrospective. However, these analysis methods can only be applied after a mature release of the code has been developed. But in order to manage the evolution of complex software systems effectively, it is important to identify change-prone classes as early as possible. Specifically, developers need to know where they can expect change, the likelihood of a change, and the magnitude of these modifications in order to take proactive steps and mitigate any potential risks arising from these changes. Previous research into change-prone classes has identified some common aspects, with different studies suggesting that complex and large classes tend to undergo more changes and classes that changed recently are likely to undergo modifications in the near future. Though the guidance provided is helpful, developers need more specific guidance in order for it to be applicable in practice. Furthermore, the information needs to be available at a level that can help in developing tools that highlight and monitor evolution prone parts of a system as well as support effort estimation activities. The specific research questions that we address in this chapter are: (1) What is the likelihood that a class will change from a given version to the next? (a) Does this probability change over time? (b) Is this likelihood project specific, or general? (2) How is modification frequency distributed for classes that change? (3) What is the distribution of the magnitude of change? Are most modifications minor adjustments, or substantive modifications? (4) Does structural complexity make a class susceptible to change? (5) Does popularity make a class more change-prone? We make recommendations that can help developers to proactively monitor and manage change. These are derived from a statistical analysis of change in approximately 55000 unique classes across all projects under investigation. The analysis methods that we applied took into consideration the highly skewed nature of the metric data distributions. The raw metric data (4 .txt files and 4 .log files in a .zip file measuring ~2MB in total) is provided as a comma separated values (CSV) file, and the first line of the CSV file contains the header. A detailed output of the statistical analysis undertaken is provided as log files generated directly from Stata (statistical analysis software).
f
Data for "Preference patterns for skewed gambles in rhesus monkeys"
datasetcatalog.nlm.nih.gov
figshare.com
+1more
Updated Dec 12, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Strait, Caleb; Hayden, Benjamin (2013). Data for "Preference patterns for skewed gambles in rhesus monkeys" [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0001729748
Explore at:
Dataset updated
Dec 12, 2013
Authors
Strait, Caleb; Hayden, Benjamin
Description
Data for the paper, "Preference patterns for skewed gambles in rhesus monkeys."
4
Supplementary data for the paper "Why psychologists should not default to...
data.4tu.nl
zip
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joost de Winter, Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v1
Dataset provided by
4TU.ResearchData
Authors
Joost de Winter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Through simulations of unequal and equal variances, skewed distributions, and varying sample sizes, we confirm that the WT effectively controls false positives when the smaller sample is drawn from the population with the larger standard deviation. However, the WT is found to yield inflated false positive rates under skewed distributions, even for relatively large sample sizes. By contrast, IT exhibits higher statistical power when standard deviations are equal, and avoids the inflation of false positives under skew. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Furthermore, an additional analysis using the Kolmogorov–Smirnov and Anderson–Darling tests demonstrated that examining entire distributions rather than just their means can provide a more robust alternative when facing unequal variances or skewed data. Given these results, insisting on WT as a universal default appears unwarranted. Researchers should remain cautious with software defaults, such as R favoring Welch’s test.
Data from: The improbability of detecting trade-offs and some practical...
data.niaid.nih.gov
dataone.org
+1more
zip
Updated Jul 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Marc Johnson (2024). The improbability of detecting trade-offs and some practical solutions [Dataset]. http://doi.org/10.5061/dryad.xpnvx0kq5
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.xpnvx0kq5
Dataset updated
Jul 19, 2024
Dataset provided by
University of Toronto
Authors
Marc Johnson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Trade-offs are a fundamental concept in evolutionary biology because they are thought to explain much of nature’s biological diversity, from variation in life-histories to differences in metabolism. Despite the predicted importance of trade-offs, they are notoriously difficult to detect. Here we contribute to the existing rich theoretical literature on trade-offs by examining how the shape of the distribution of resources or metabolites acquired in an allocation pathway influences the strength of trade-offs between traits. We further explore how variation in resource distribution interacts with two aspects of pathway complexity (i.e., the number of branches and hierarchical structure) affects tradeoffs. We simulate variation in the shape of the distribution of a resource by sampling 106 individuals from a beta distribution with varying parameters to alter the resource shape. In a simple “Y-model” allocation of resources to two traits, any variation in a resource leads to slopes less than -1, with left skewed and symmetrical distributions leading to negative relationships between traits, and highly right skewed distributions associated with positive relationships between traits. Adding more branches further weakens negative and positive relationships between traits, and the hierarchical structure of pathways typically weakens relationships between traits, although in some contexts hierarchical complexity can strengthen positive relationships between traits. Our results further illuminate how variation in the acquisition and allocation of resources, and particularly the shape of a resource distribution and how it interacts with pathway complexity, makes it challenging to detect trade-offs. We offer several practical suggestions on how to detect trade-offs given these challenges. Methods Overview of Flux Simulations To study the strength and direction of trade-offs within a population, we developed a simulation of flux in a simple metabolic pathway, where a precursor metabolite emerging from node A may either be converted to metabolic products B1 or B2 (Fig. 1). This conception of a pathway is similar to De Jong and Van Noordwijk’s Y-model (Van Noordwijk & De Jong, 1986; De Jong & Van Noordwijk, 1992), but we used simulation instead of analytical statistical models to allow us to consider greater complexity in the distribution of variables and pathways. For a simple pathway (Fig. 1), the total flux Jtotal (i.e., the flux at node A, denoted as JA) for each individual (N = 106) was first sampled from a predetermined beta distribution as described below. The flux at node B1 (JB1) was then randomly sampled from this distribution with max = Jtotal = JA and min = 0. The flux at the remaining node, B2, was then simply the remaining flux (JB2 = JA - JB1). Simulations of more complex pathways followed the same basic approach as described above, with increased numbers of branches and hierarchical levels added to the pathway as described below under Question 2. The metabolic pathways were simulated using Python (v. 3.8.2) (Van Rossum & Drake Jr., 2009) where we could control the underlying distribution of metabolite allocation. The output flux at nodes B1 and B2 was plotted using R (v. 4.2.1) (Team, 2022) with the resulting trade-off visualized as a linear regression using the ggplot2 R package (v. 3.4.2) (Wickham, 2016). While we have conceptualized the pathway as the flux of metabolites, it could be thought of as any resource being allocated to different traits. Question 1: How does variation in resource distribution within a population affect the strength and direction of trade-offs? We first simulated the simplest scenario where all individuals had the same total flux Jtotal = 1, in which case the phenotypic trade-off is expected to be most easily detected. We then modified this initial scenario to explore how variation in the distribution of resource acquisition (Jtotal) affected the strength and direction of trade-offs. Specifically, the resource distribution was systematically varied by sampling n = 103 total flux levels from a beta distribution, which has two parameters alpha and beta that control the size and shape of the distribution (Miller & Miller, 1999). When alpha is large and beta is small, the distribution is left skewed, whereas for small alpha and large beta, the distribution is right skewed. Likewise, for alpha = beta, the curve is symmetrical and approximately normal when the parameters are sufficiently large (>2). We can thus systematically vary the underlying resource distribution of a population by iterating through values of alpha and beta from 0.5 to 5 (in increments of 0.5), which was done using the NumPy Python package (v. 1.19.1) (Harris et al., 2020). The resulting slope of each linear regression of the flux at B1 and B2 (i.e., the two branching nodes) was then calculated using the lm function in R and plotted as a contour map using the latticeExtra Rpackage (v. 0.6-30) (Sarkar, 2008). Question 2: How does the complexity of the pathway used to produce traits affect the strength and direction of trade-offs? Metabolic pathways are typically more complex than what is described above. Most pathways consist of multiple branch points and multiple hierarchical levels. To understand how complexity affects the ability to detect trade-offs when combined with variation in the distribution of total flux we systematically manipulated the number of branch points and hierarchical levels within pathways (Fig. 1). We first explored the effect of adding branches to the pathway from the same node, such that instead of only branching off to nodes B1 and B2, the pathway branched to nodes B1 through to Bn (Fig. 1B), where n is the total number of branches (maximum n = 10 branches). Flux at a node was calculated as previously described, and the remaining flux was evenly distributed amongst the remaining nodes (i.e., nodes B2 through to Bnwould each receive J2-n = (Jtotal - JB1)/(n - 1) flux). For each pathway, we simulated flux using a beta distribution of Jtotalwith alpha = 5, beta = 0.5 to simulate a left skewed distribution, alpha = beta = 5 to simulate a normal distribution, and with alpha = 0.5, beta = 5 to simulate a right skewed distribution, as well as the simplest case where all individuals have total flux Jtotal = 1. We next considered how adding hierarchical levels to a metabolic pathway affected trade-offs. We modified our initial pathway with node A branching to nodes B1 and B2, and then node B2 further branched to nodes C1 and C2 (Fig. 1C). To compute the flux at the two new nodes C1 and C2, we simply repeated the same calculation as before, but using the flux at node B2, JB2, as the total flux. That is, the flux at node C1 was obtained by randomly sampling from the distribution at B2 with max = JB and min = 0, and the flux at node C2 is the remaining flux (JC = JB2 - JC1). Much like in the previous scenario with multiple branch points, we used three beta distributions (with the same parameters as before) to represent left, normal, and right skewed resource distributions, as well as the simplest case where Jtotal = 1 for all individuals. Quantile Regressions We performed quantile regression to understand whether this approach could help to detect trade-offs. Quantile regression is a form of statistical analysis that fits a curve through upper or lower quantiles of the data to assess whether an independent variable potentially sets a lower or upper limit to a response variable (Cade et al., 1999). This type of analysis is particularly useful when it is thought that an independent variable places a constraint on a response variable, yet variation in the response variable is influenced by many additional factors that add “noise” to the data, making a simple bivariate relationship difficult to detect (Thomson et al., 1996). Quantile regression is an extension of ordinary least squares regression, which regresses the best fitting line through the 50th percentile of the data. In addition to performing ordinary least squares regression for each pairwise comparison between the four nodes (B1, B2, C1, C2), we performed a series of quantile regressions using the ggplot2 R package (v. 3.4.2), where only the qth quantile was used for the regression (q = 0.99 and 0.95 to 0.5 in increments of 0.05, see Fig. S1) (Cade et al., 1999).

Facebook

Twitter

Click to copy link

Link copied

Cite

Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak; Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak (2022). Data from: Improving structured population models with more realistic representations of non-normal growth [Dataset]. http://doi.org/10.5061/dryad.t6c3573

Data from: Improving structured population models with more realistic representations of non-normal growth

Explore at:

Unique identifier

https://doi.org/10.5061/dryad.t6c3573

Dataset updated

Jun 1, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak; Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction, and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet almost all published IPMs make an important assumption: that size-dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage. 2. Here we develop a flexible approach to modeling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size-dependent minimum and maximum sizes through quantile regression. Transformed data can then be modeled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long-lived plant, gorgonians, and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model. 3. The beta and skewed normal distributions accurately capture the mean, variance, and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated lifespan relative to IPMs that assume normally-distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modeled size range. 4. Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.

Clear search

Close search

Google apps

Main menu

Data from: Improving structured population models with more realistic...

Data from: Clustering Spatial Data with a Mixture of Skewed Regression...

Replication Data for: Accounting for Skewed or One-Sided Measurement Error...

Median and IQR of skewed data for CRP.

Data from: Selection on skewed characters and the paradox of stasis

Additional file 2 of Modelling count, bounded and skewed continuous outcomes...

Flood Region A

Results and analysis using the Lean Six-Sigma define, measure, analyze,...

Randomized Battery Usage 7: Low-Temperature Left-Skewed Random Walk

Loan Default Risk Prediction Dataset

📖 Dataset Overview

📊 Features & Data Structure

📌 Data Generation & Logic

Data from: Body temperature distributions of active diurnal lizards in three...

Additional file 3 of Modelling count, bounded and skewed continuous outcomes...

Randomized Battery Usage 5: High Temperature Right-Skewed Random Walk

Data from: Evolution of quantitative traits under a migration-selection...

Annual peak-flow data and PeakFQ output files for selected streamflow gaging...

Data from: Data publication: Demand-pull and technology-push: What drives...

This data publication contains all data and R-code used for the paper:

Hötte, Kerstin (2021): "Demand-pull Demand-pull and technology-push: What drives the direction of technological change? An empirical network-based approach".

Abstract:

To reproduce the results and the data from the raw data, you must run the code provided in the

This data publication consists of 6 folders:

Details:

Data from: Data files used to study change dynamics in software systems

Data for "Preference patterns for skewed gambles in rhesus monkeys"

Supplementary data for the paper "Why psychologists should not default to...

Data from: The improbability of detecting trade-offs and some practical...

Data from: Improving structured population models with more realistic representations of non-normal growth