34 datasets found
  1. f

    Data Sheet 1_The impact of distribution properties on sampling behavior.docx...

    • figshare.com
    docx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Frontiers
    Authors
    Thai Quoc Cao; Benjamin Scheibehenne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.

  2. Normalization of High Dimensional Genomics Data Where the Distribution of...

    • plos.figshare.com
    tiff
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mattias Landfors; Philge Philip; Patrik Rydén; Per Stenberg (2023). Normalization of High Dimensional Genomics Data Where the Distribution of the Altered Variables Is Skewed [Dataset]. http://doi.org/10.1371/journal.pone.0027942
    Explore at:
    tiffAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Mattias Landfors; Philge Philip; Patrik Rydén; Per Stenberg
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.

  3. Data from: bicycle store dataset

    • kaggle.com
    zip
    Updated Sep 11, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rohit Sahoo (2020). bicycle store dataset [Dataset]. https://www.kaggle.com/rohitsahoo/bicycle-store-dataset
    Explore at:
    zip(682639 bytes)Available download formats
    Dataset updated
    Sep 11, 2020
    Authors
    Rohit Sahoo
    License

    http://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html

    Description

    Context

    Perform Exploratory Data Analysis on the Bicycle Store Dataset!

    DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.

    MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.

    INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details

    Content

    The dataset is easy to understand and self-explanatory!

    Inspiration

    It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?

  4. n

    Data from: Evolution of quantitative traits under a migration-selection...

    • data.niaid.nih.gov
    • dataone.org
    • +1more
    zip
    Updated Jul 21, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florence Débarre; Sam Yeaman; Frédéric Guillaume (2015). Evolution of quantitative traits under a migration-selection balance: when does skew matter? [Dataset]. http://doi.org/10.5061/dryad.ms52b
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 21, 2015
    Authors
    Florence Débarre; Sam Yeaman; Frédéric Guillaume
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Quantitative-genetic models of differentiation under migration-selection balance often rely on the assumption of normally distributed genotypic and phenotypic values. When a population is subdivided into demes with selection toward different local optima, migration between demes may result in asymmetric, or skewed, local distributions. Using a simplified two-habitat model, we derive formulas without a priori assuming a Gaussian distribution of genotypic values, and we find expressions that naturally incorporate higher moments, such as skew. These formulas yield predictions of the expected divergence under migration-selection balance that are more accurate than models assuming Gaussian distributions, which illustrates the importance of incorporating these higher moments to assess the response to selection in heterogeneous environments. We further show with simulations that traits with loci of large effect display the largest skew in their distribution at migration-selection balance.

  5. Data from: The improbability of detecting trade-offs and some practical...

    • data.niaid.nih.gov
    • dataone.org
    • +2more
    zip
    Updated Jul 19, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Marc Johnson (2024). The improbability of detecting trade-offs and some practical solutions [Dataset]. http://doi.org/10.5061/dryad.xpnvx0kq5
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 19, 2024
    Dataset provided by
    University of Toronto
    Authors
    Marc Johnson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Trade-offs are a fundamental concept in evolutionary biology because they are thought to explain much of nature’s biological diversity, from variation in life-histories to differences in metabolism. Despite the predicted importance of trade-offs, they are notoriously difficult to detect. Here we contribute to the existing rich theoretical literature on trade-offs by examining how the shape of the distribution of resources or metabolites acquired in an allocation pathway influences the strength of trade-offs between traits. We further explore how variation in resource distribution interacts with two aspects of pathway complexity (i.e., the number of branches and hierarchical structure) affects tradeoffs. We simulate variation in the shape of the distribution of a resource by sampling 106 individuals from a beta distribution with varying parameters to alter the resource shape. In a simple “Y-model” allocation of resources to two traits, any variation in a resource leads to slopes less than -1, with left skewed and symmetrical distributions leading to negative relationships between traits, and highly right skewed distributions associated with positive relationships between traits. Adding more branches further weakens negative and positive relationships between traits, and the hierarchical structure of pathways typically weakens relationships between traits, although in some contexts hierarchical complexity can strengthen positive relationships between traits. Our results further illuminate how variation in the acquisition and allocation of resources, and particularly the shape of a resource distribution and how it interacts with pathway complexity, makes it challenging to detect trade-offs. We offer several practical suggestions on how to detect trade-offs given these challenges. Methods Overview of Flux Simulations To study the strength and direction of trade-offs within a population, we developed a simulation of flux in a simple metabolic pathway, where a precursor metabolite emerging from node A may either be converted to metabolic products B1 or B2 (Fig. 1). This conception of a pathway is similar to De Jong and Van Noordwijk’s Y-model (Van Noordwijk & De Jong, 1986; De Jong & Van Noordwijk, 1992), but we used simulation instead of analytical statistical models to allow us to consider greater complexity in the distribution of variables and pathways. For a simple pathway (Fig. 1), the total flux Jtotal (i.e., the flux at node A, denoted as JA) for each individual (N = 106) was first sampled from a predetermined beta distribution as described below. The flux at node B1 (JB1) was then randomly sampled from this distribution with max = Jtotal = JA and min = 0. The flux at the remaining node, B2, was then simply the remaining flux (JB2 = JA - JB1). Simulations of more complex pathways followed the same basic approach as described above, with increased numbers of branches and hierarchical levels added to the pathway as described below under Question 2. The metabolic pathways were simulated using Python (v. 3.8.2) (Van Rossum & Drake Jr., 2009) where we could control the underlying distribution of metabolite allocation. The output flux at nodes B1 and B2 was plotted using R (v. 4.2.1) (Team, 2022) with the resulting trade-off visualized as a linear regression using the ggplot2 R package (v. 3.4.2) (Wickham, 2016). While we have conceptualized the pathway as the flux of metabolites, it could be thought of as any resource being allocated to different traits. Question 1: How does variation in resource distribution within a population affect the strength and direction of trade-offs? We first simulated the simplest scenario where all individuals had the same total flux Jtotal = 1, in which case the phenotypic trade-off is expected to be most easily detected. We then modified this initial scenario to explore how variation in the distribution of resource acquisition (Jtotal) affected the strength and direction of trade-offs. Specifically, the resource distribution was systematically varied by sampling n = 103 total flux levels from a beta distribution, which has two parameters alpha and beta that control the size and shape of the distribution (Miller & Miller, 1999). When alpha is large and beta is small, the distribution is left skewed, whereas for small alpha and large beta, the distribution is right skewed. Likewise, for alpha = beta, the curve is symmetrical and approximately normal when the parameters are sufficiently large (>2). We can thus systematically vary the underlying resource distribution of a population by iterating through values of alpha and beta from 0.5 to 5 (in increments of 0.5), which was done using the NumPy Python package (v. 1.19.1) (Harris et al., 2020). The resulting slope of each linear regression of the flux at B1 and B2 (i.e., the two branching nodes) was then calculated using the lm function in R and plotted as a contour map using the latticeExtra Rpackage (v. 0.6-30) (Sarkar, 2008). Question 2: How does the complexity of the pathway used to produce traits affect the strength and direction of trade-offs? Metabolic pathways are typically more complex than what is described above. Most pathways consist of multiple branch points and multiple hierarchical levels. To understand how complexity affects the ability to detect trade-offs when combined with variation in the distribution of total flux we systematically manipulated the number of branch points and hierarchical levels within pathways (Fig. 1). We first explored the effect of adding branches to the pathway from the same node, such that instead of only branching off to nodes B1 and B2, the pathway branched to nodes B1 through to Bn (Fig. 1B), where n is the total number of branches (maximum n = 10 branches). Flux at a node was calculated as previously described, and the remaining flux was evenly distributed amongst the remaining nodes (i.e., nodes B2 through to Bnwould each receive J2-n = (Jtotal - JB1)/(n - 1) flux). For each pathway, we simulated flux using a beta distribution of Jtotalwith alpha = 5, beta = 0.5 to simulate a left skewed distribution, alpha = beta = 5 to simulate a normal distribution, and with alpha = 0.5, beta = 5 to simulate a right skewed distribution, as well as the simplest case where all individuals have total flux Jtotal = 1. We next considered how adding hierarchical levels to a metabolic pathway affected trade-offs. We modified our initial pathway with node A branching to nodes B1 and B2, and then node B2 further branched to nodes C1 and C2 (Fig. 1C). To compute the flux at the two new nodes C1 and C2, we simply repeated the same calculation as before, but using the flux at node B2, JB2, as the total flux. That is, the flux at node C1 was obtained by randomly sampling from the distribution at B2 with max = JB and min = 0, and the flux at node C2 is the remaining flux (JC = JB2 - JC1). Much like in the previous scenario with multiple branch points, we used three beta distributions (with the same parameters as before) to represent left, normal, and right skewed resource distributions, as well as the simplest case where Jtotal = 1 for all individuals. Quantile Regressions We performed quantile regression to understand whether this approach could help to detect trade-offs. Quantile regression is a form of statistical analysis that fits a curve through upper or lower quantiles of the data to assess whether an independent variable potentially sets a lower or upper limit to a response variable (Cade et al., 1999). This type of analysis is particularly useful when it is thought that an independent variable places a constraint on a response variable, yet variation in the response variable is influenced by many additional factors that add “noise” to the data, making a simple bivariate relationship difficult to detect (Thomson et al., 1996). Quantile regression is an extension of ordinary least squares regression, which regresses the best fitting line through the 50th percentile of the data. In addition to performing ordinary least squares regression for each pairwise comparison between the four nodes (B1, B2, C1, C2), we performed a series of quantile regressions using the ggplot2 R package (v. 3.4.2), where only the qth quantile was used for the regression (q = 0.99 and 0.95 to 0.5 in increments of 0.05, see Fig. S1) (Cade et al., 1999).

  6. 4

    Supplementary data for the paper "Why psychologists should not default to...

    • data.4tu.nl
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.

  7. Data from: Food web interaction strength distributions are conserved by...

    • search.datacite.org
    • datadryad.org
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak (2019). Data from: Food web interaction strength distributions are conserved by greater variation between than within predator-prey pairs [Dataset]. http://doi.org/10.5061/dryad.sr6888t
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Dryad
    Authors
    Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    National Science Foundation
    Description

    Species interactions in food webs are usually recognized as dynamic, varying across species, space and time due to biotic and abiotic drivers. Yet food webs also show emergent properties that appear consistent, such as a skewed frequency distribution of interaction strengths (many weak, few strong). Reconciling these two properties requires an understanding of the variation in pairwise interaction strengths and its underlying mechanisms. We estimated stream sculpin feeding rates in three seasons at nine sites in Oregon to examine variation in trophic interaction strengths both across and within predator-prey pairs. Predator and prey densities, prey body mass, and abiotic factors were considered as putative drivers of within-pair variation over space and time. We hypothesized that consistently skewed interaction strength distributions could result if individual interaction strengths show relatively little variation, or alternatively, if interaction strengths vary but shift in ways that conserve their overall frequency distribution. Feeding rate distributions remained consistently and positively skewed across all sites and seasons. The mean coefficient of variation in feeding rates within each of 25 focal species pairs across surveys was less than half the mean coefficient of variation seen across species pairs within a survey. The rank order of feeding rates also remained conserved across streams, seasons and individual surveys. On average, feeding rates on each prey taxon nonetheless varied by a hundredfold, with some feeding rates showing more variation in space and others in time. In general, feeding rates increased with prey density and decreased with high stream flows and low water temperatures, although for nearly half of all species pairs, factors other than prey density explained the most variation. Our findings show that although individual interaction strengths exhibit considerable variation in space and time, they can nonetheless remain relatively consistent, and thus predictable, compared to the even larger variation that occurs across species pairs. These results highlight how the ecological scale of inference can strongly shape conclusions about interaction strength consistency and collectively help reconcile how the skewed nature of interaction strength distributions can persist in highly dynamic food webs.

  8. n

    Data from: Selection on skewed characters and the paradox of stasis

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.

  9. m

    Impact of limited data availability on the accuracy of project duration...

    • data.mendeley.com
    Updated Nov 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naimeh Sadeghi (2022). Impact of limited data availability on the accuracy of project duration estimation in project networks [Dataset]. http://doi.org/10.17632/bjfdw6xbxw.3
    Explore at:
    Dataset updated
    Nov 22, 2022
    Authors
    Naimeh Sadeghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations. Number of variables: 20 Number of cases/rows: 114240 Variable List: • Experiment ID: The ID of the experiment • Experiment for network: The ID of the experiment for each of the synthesized networks • Network ID: ID of the synthesized network • #Activities: Number of activities in the network, including start and finish activities • Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.) • Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
    • Fitted distribution type: Distribution type used to fit on sampled data • Sample size: Number of sampled data used for the experiment resembling limited data condition • Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network • Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network • Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network • Benchmark mean: Mean project duration in the benchmark stochastic project network • Benchmark variance: Variance project duration in the benchmark stochastic project network • Experiment 10th percentile: 10th percentile of project duration distribution for the experiment • Experiment 50th percentile: 50th percentile of project duration distribution for the experiment • Experiment 90th percentile: 90th percentile of project duration distribution for the experiment • Experiment mean: Mean of project duration distribution for the experiment • Experiment variance: Variance of project duration distribution for the experiment • K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration • distribution of the experiment • P_value: the P-value based on the distance calculated in the K-S test

  10. r

    Censored density forecasts: Production and evaluation (replication data)

    • resodate.org
    Updated Oct 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    James Mitchell; Martin Robert Weale (2025). Censored density forecasts: Production and evaluation (replication data) [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC9jZW5zb3JlZC1kZW5zaXR5LWZvcmVjYXN0cy1wcm9kdWN0aW9uLWFuZC1ldmFsdWF0aW9uLXJlcGxpY2F0aW9uLWRhdGE=
    Explore at:
    Dataset updated
    Oct 2, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW
    ZBW Journal Data Archive
    Authors
    James Mitchell; Martin Robert Weale
    Description

    This paper develops methods for the production and evaluation of censored density forecasts. The focus is on censored density forecasts that quantify forecast risks in a middle region of the density covering a specified probability, and ignore the magnitude but not the frequency of outlying observations. We propose a fixed-point algorithm that fits a potentially skewed and fat-tailed density to the inner observations, acknowledging that the outlying observations may be drawn from a different but unknown distribution. We also introduce a new test for calibration of censored density forecasts. An application using historical forecast errors from the Federal Reserve Board and the Monetary Policy Committee (MPC) at the Bank of England suggests that the use of censored density functions to represent the pattern of forecast errors results in much greater parameter stability than do uncensored densities. We illustrate the utility of censored density forecasts when quantifying forecast risks after shocks such as the global financial crisis and the COVID-19 pandemic and find that these outperform the official forecasts produced by the MPC.

  11. Ames Housing Engineered Dataset

    • kaggle.com
    Updated Sep 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Atefeh Amjadian (2025). Ames Housing Engineered Dataset [Dataset]. https://www.kaggle.com/datasets/atefehamjadian/ameshousing-engineered
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 27, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Atefeh Amjadian
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Area covered
    Ames
    Description

    This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.

    The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.

    Preprocessing Steps Applied

    1. Missing Value Handling: Missing values in categorical columns with meaningful absence (e.g., no pool for PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.
    2. Correlation-based Feature Selection: Numeric features with absolute correlation < 0.1 with SalePrice were removed.
    3. Encoding Categorical Variables: Ordinal features (e.g., quality ratings) were encoded using OrdinalEncoder, and nominal features (e.g., neighborhoods) using OneHotEncoder.
    4. Outlier Handling: Outliers in numeric features were detected using IQR and capped (Winsorized) to IQR bounds to preserve data while reducing extreme values.
    5. Skewness Handling: Highly skewed numeric features (|skew| > 1) were transformed using Yeo-Johnson to make distributions more normal-like.
    6. Additional Feature Selection: Low-variance one-hot features (variance < 0.01) and highly collinear features (|corr| > 0.8) were removed.
    7. Feature Scaling: Numeric features were scaled using RobustScaler to handle outliers.
    8. Duplicate Removal: Duplicate rows were checked and removed if found (none in this dataset).

    The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.

    New Features Created

    To add more predictive power, the following new features were created based on domain knowledge: 1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation. - Example: A house built in 2000 and sold in 2008 has HouseAge = 8. 2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes. - Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500. 3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction. - Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500. 4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values. - Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.

    These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.

    Data Dictionary

    • Original Numeric Features: Kept features with |corr| > 0.1 with SalePrice, such as:
      • OverallQual: Material and finish quality (scaled, 1-10).
      • GrLivArea: Above grade (ground) living area square feet (scaled).
      • GarageCars: Size of garage in car capacity (scaled).
      • TotalBsmtSF: Total square feet of basement area (scaled).
      • And others like FullBath, YearBuilt, etc. (see the code for the full list).
    • Ordinal Encoded Features: Quality and condition ratings, e.g.:
      • ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).
      • BsmtQual: Basement quality (encoded as 0=None to 5=Ex).
    • One-Hot Encoded Features: Nominal categorical features, e.g.:
      • MSZoning_RL: 1 if residential low density, 0 otherwise.
      • Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.
    • New Engineered Features (as described above):
      • HouseAge: Age of the house (scaled).
      • Quality_x_Size: Overall quality times living area (scaled).
      • TotalSF: Total square footage (scaled).
      • Log_LotArea: Log-transformed lot area (scaled).
    • Target: SalePrice - The property's sale price in dollars (not scaled, as it's the target).

    Total columns: Approximately 200-250 (after one-hot encoding and feature selection).

    License

    This dataset is derived from the Ames Housing...

  12. fire-detection-data

    • kaggle.com
    zip
    Updated Nov 1, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    jake_proj (2022). fire-detection-data [Dataset]. https://www.kaggle.com/datasets/jakeproj/fire-detection-data
    Explore at:
    zip(764557317 bytes)Available download formats
    Dataset updated
    Nov 1, 2022
    Authors
    jake_proj
    Description

    About Dataset Context The dataset was created by my team during the NASA Space Apps Challenge in 2018, the goal was using the dataset to develop a model that can recognize the images with fire. If you seek more info about the Context or the challenge, then you can visit Our team page.

    Content Data was collected to train a model to distinguish between the images that contain fire (fire images) and regular images (non-fire images), so the whole problem was binary classification.

    Data is divided into 2 folders, fireimages folder contains 755 outdoor-fire images some of them contains heavy smoke, the other one is non-fireimages which contain 244 nature images (eg: forest, tree, grass, river, people, foggy forest, lake, animal, road, and waterfall).

    Hint: Data is skewed, which means the 2 classes(folders) doesn't have an equal number of samples, so make sure that you have a validation set with an equally-sized number of images per class (eg: 40 images of both fire and non-fire classes).

    Acknowledgements Team Members: 1-Ahmed Gamaleldin: https://www.linkedin.com/in/ahmedgamal1496/ 2-Ahmed Atef: https://www.linkedin.com/in/ahmed-atef-a081aa141/ 3-Heba Saker: https://www.linkedin.com/in/heba-sakr/ 4-Ahmed Shaheen: https://www.linkedin.com/in/ahmed-a-shaheen/

  13. Geophone Sensor Dataset

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Furkan Sezgin (2024). Geophone Sensor Dataset [Dataset]. https://www.kaggle.com/datasets/sezginfurkan/geophone-sensor-dataset
    Explore at:
    zip(75617 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Furkan Sezgin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains vibration data collected from a geoscope sensor to analyze human activities (walking, running, and waiting). The data is segmented into 3-second time windows, with each window containing 120 rows of data per person. The dataset consists of 1800 rows of data from five individuals: Furkan, Enes, Yusuf, Alihan and Emir.

    Each person’s activity is classified into one of the three categories: walking, running, or standing still. The data includes both statistical and frequency-domain features extracted from the raw vibration signals, detailed below:

    Statistical Features: - Mean: The average value of the signal over the time window.- - Median: The middle value of the signal, dividing the data into two equal halves. - Standard Deviation: A measure of how much the signal deviates from its mean, indicating the signal's variability. - Minimum: The smallest value in the signal during the time window. - Maximum: The largest value in the signal during the time window. - First Quartile (Q1): The median of the lower half of the data, representing the 25th percentile. - Third Quartile (Q3): The median of the upper half of the data, representing the 75th percentile. - Skewness: A measure of the asymmetry of the signal distribution, showing whether the data is skewed to the left or right.

    Frequency-Domain Features: - Dominant Frequency: The frequency with the highest power, providing insights into the primary periodicity of the signal. - Signal Energy: The total energy of the signal, representing the sum of the squared signal values over the time window.

    Dataset Overview: - Total Rows: 1800 - Number of Individuals: 5 (Furkan, Enes, Yusuf, Alihan, Emir) - Activity Types: Walking, Running, Waiting (Standing Still) - Time Frame: 3-second time windows (120 rows per individual for each activity) - Features: Statistical and frequency-domain features (as described above)

    This dataset is suitable for training models on activity recognition, user identification, and other related tasks. It provides rich, detailed features that can be used for various classification and analysis applications.

  14. r

    Uncertainty, skewness, and the business cycle through the MIDAS lens:...

    • resodate.org
    Updated Oct 6, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Efrem Castelnuovo; Lorenzo Mori (2025). Uncertainty, skewness, and the business cycle through the MIDAS lens: replication data [Dataset]. https://resodate.org/resources/aHR0cHM6Ly9qb3VybmFsZGF0YS56YncuZXUvZGF0YXNldC91bmNlcnRhaW50eS1za2V3bmVzcy1hbmQtdGhlLWJ1c2luZXNzLWN5Y2xlLXRocm91Z2gtdGhlLW1pZGFzLWxlbnMtcmVwbGljYXRpb24tZGF0YQ==
    Explore at:
    Dataset updated
    Oct 6, 2025
    Dataset provided by
    Journal of Applied Econometrics
    ZBW
    ZBW Journal Data Archive
    Authors
    Efrem Castelnuovo; Lorenzo Mori
    Description

    Data and replication information for "Uncertainty, skewness, and the business cycle through the MIDAS lens" by Efrem Castelnuovo and Lorenzo Mori; published in Journal of Applied Econometrics, 2024. We employ a mixed-frequency quantile regression approach to model the time-varying conditional distribution of the US real GDP growth rate. We show that monthly information on financial conditions improves the predictive power of an otherwise quarterly-only model. We combine selected quantiles of the estimated conditional distribution to produce novel measures of uncertainty and skewness. Embedding these measures in a VAR framework, we show that unexpected changes in uncertainty are associated with an increase in (left) skewness and a downturn in real activity. Business cycle effects are significantly downplayed if we consider a quarterly-only quantile regression model. We find the endogenous response of skewness to substantially amplify the recessionary effects of uncertainty shocks. Finally, we construct a monthly-frequency version of our uncertainty measure and document the robustness of our findings.

  15. Data from: Social contact patterns can buffer costs of forgetting in the...

    • zenodo.org
    • data.niaid.nih.gov
    • +1more
    txt, zip
    Updated May 30, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jeffrey R. Stevens; Jan K. Woike; Lael J. Schooler; Stefan Lindner; Thorsten Pachur; Jeffrey R. Stevens; Jan K. Woike; Lael J. Schooler; Stefan Lindner; Thorsten Pachur (2022). Data from: Social contact patterns can buffer costs of forgetting in the evolution of cooperation [Dataset]. http://doi.org/10.5061/dryad.4cd6042
    Explore at:
    txt, zipAvailable download formats
    Dataset updated
    May 30, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jeffrey R. Stevens; Jan K. Woike; Lael J. Schooler; Stefan Lindner; Thorsten Pachur; Jeffrey R. Stevens; Jan K. Woike; Lael J. Schooler; Stefan Lindner; Thorsten Pachur
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Analyses of the evolution of cooperation often rely on two simplifying assumptions: (i) individuals interact equally frequently with all social network members and (ii) they accurately remember each partner's past cooperation or defection. Here, we examine how more realistic, skewed patterns of contact---in which individuals interact primarily with only a subset of their network's members---influence cooperation. In addition, we test whether skewed contact patterns can counteract the decrease in cooperation caused by memory errors (i.e., forgetting). Finally, we compare two types of memory error that vary in whether forgotten interactions are replaced with random actions or with actions from previous encounters. We use evolutionary simulations of repeated prisoner's dilemma games that vary agents' contact patterns, forgetting rates, and types of memory error. We find that highly skewed contact patterns foster cooperation and also buffer the detrimental effects of forgetting. The type of memory error used also influences cooperation rates. Our findings reveal previously neglected but important roles of contact patterns, type of memory error, and the interaction of contact pattern and memory on cooperation. Although cognitive limitations may constrain the evolution of cooperation, social contact patterns can counteract some of these constraints.

  16. d

    Data for: The learnability consequences of Zipfian distributions: Word...

    • demo-b2find.dkrz.de
    Updated Sep 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Data for: The learnability consequences of Zipfian distributions: Word Segmentation is Facilitated in More Predictable Distributions - Dataset - B2FIND [Dataset]. http://demo-b2find.dkrz.de/dataset/298b62a8-d152-591a-9d52-eff38012505e
    Explore at:
    Dataset updated
    Sep 21, 2025
    Description

    Data of Hebrew speaking children and adults on an auditory statistical learning experiment looking at the effect of distribution predictability on segmentation. While the languages of the world differ in many respects, they share certain commonalties, which can provide insight on our shared cognition. Here, we explore the learnability consequences of one of the striking commonalities between languages. Across languages, word frequencies follow a Zipfian distribution, showing a power law relation between a word's frequency and its rank. While their source in language has been studied extensively, less work has explored the learnability consequences of such distributions for language learners. We propose that the greater predictability of words in this distribution (relative to less skewed distributions) can facilitate word segmentation, a crucial aspect of early language acquisition. To explore this, we quantify word predictability using unigram entropy, assess it across languages using naturalistic corpora of child-directed speech and then ask whether similar unigram predictability facilitates word segmentation in the lab. We find similar unigram entropy in child-directed speech across 15 languages. We then use an auditory word segmentation task to show that the unigram predictability levels found in natural language are uniquely facilitative for word segmentation for both children and adults. These findings illustrate the facilitative impact of skewed input distributions on learning and raise questions about the possible role of cognitive pressures in the prevalence of Zipfian distributions in language.

  17. Table_1_Application of robust regression in translational neuroscience...

    • frontiersin.figshare.com
    • datasetcatalog.nlm.nih.gov
    docx
    Updated Jan 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Malek-Ahmadi; Stephen D. Ginsberg; Melissa J. Alldred; Scott E. Counts; Milos D. Ikonomovic; Eric E. Abrahamson; Sylvia E. Perez; Elliott J. Mufson (2024). Table_1_Application of robust regression in translational neuroscience studies with non-Gaussian outcome data.DOCX [Dataset]. http://doi.org/10.3389/fnagi.2023.1299451.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jan 24, 2024
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Michael Malek-Ahmadi; Stephen D. Ginsberg; Melissa J. Alldred; Scott E. Counts; Milos D. Ikonomovic; Eric E. Abrahamson; Sylvia E. Perez; Elliott J. Mufson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Linear regression is one of the most used statistical techniques in neuroscience, including the study of the neuropathology of Alzheimer’s disease (AD) dementia. However, the practical utility of this approach is often limited because dependent variables are often highly skewed and fail to meet the assumption of normality. Applying linear regression analyses to highly skewed datasets can generate imprecise results, which lead to erroneous estimates derived from statistical models. Furthermore, the presence of outliers can introduce unwanted bias, which affect estimates derived from linear regression models. Although a variety of data transformations can be utilized to mitigate these problems, these approaches are also associated with various caveats. By contrast, a robust regression approach does not impose distributional assumptions on data allowing for results to be interpreted in a similar manner to that derived using a linear regression analysis. Here, we demonstrate the utility of applying robust regression to the analysis of data derived from studies of human brain neurodegeneration where the error distribution of a dependent variable does not meet the assumption of normality. We show that the application of a robust regression approach to two independent published human clinical neuropathologic data sets provides reliable estimates of associations. We also demonstrate that results from a linear regression analysis can be biased if the dependent variable is significantly skewed, further indicating robust regression as a suitable alternate approach.

  18. I

    Data from: Geographically skewed recruitment and COVID-19 seroprevalence...

    • data.niaid.nih.gov
    • immport.org
    url
    Updated Feb 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Geographically skewed recruitment and COVID-19 seroprevalence estimates: a cross-sectional serosurveillance study and mathematical modelling analysis [Dataset]. http://doi.org/10.21430/M3RQBC30KW
    Explore at:
    urlAvailable download formats
    Dataset updated
    Feb 29, 2024
    License

    https://www.immport.org/agreementhttps://www.immport.org/agreement

    Description

    Objectives: Convenience sampling is an imperfect but important tool for seroprevalence studies. For COVID-19, local geographic variation in cases or vaccination can confound studies that rely on the geographically skewed recruitment inherent to convenience sampling. The objectives of this study were: (1) quantifying how geographically skewed recruitment influences SARS-CoV-2 seroprevalence estimates obtained via convenience sampling and (2) developing new methods that employ Global Positioning System (GPS)-derived foot traffic data to measure and minimise bias and uncertainty due to geographically skewed recruitment. Design: We used data from a local convenience-sampled seroprevalence study to map the geographic distribution of study participants' reported home locations and compared this to the geographic distribution of reported COVID-19 cases across the study catchment area. Using a numerical simulation, we quantified bias and uncertainty in SARS-CoV-2 seroprevalence estimates obtained using different geographically skewed recruitment scenarios. We employed GPS-derived foot traffic data to estimate the geographic distribution of participants for different recruitment locations and used this data to identify recruitment locations that minimise bias and uncertainty in resulting seroprevalence estimates. Results: The geographic distribution of participants in convenience-sampled seroprevalence surveys can be strongly skewed towards individuals living near the study recruitment location. Uncertainty in seroprevalence estimates increased when neighbourhoods with higher disease burden or larger populations were undersampled. Failure to account for undersampling or oversampling across neighbourhoods also resulted in biased seroprevalence estimates. GPS-derived foot traffic data correlated with the geographic distribution of serosurveillance study participants. Conclusions: Local geographic variation in seropositivity is an important concern in SARS-CoV-2 serosurveillance studies that rely on geographically skewed recruitment strategies. Using GPS-derived foot traffic data to select recruitment sites and recording participants' home locations can improve study design and interpretation.

  19. d

    Data and reproducible analysis files from: Latitudinal clines in floral...

    • search.dataone.org
    • data.niaid.nih.gov
    • +1more
    Updated Jan 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mia Akbar; Dale Moskoff; Spencer Barrett; Robert Colautti (2025). Data and reproducible analysis files from: Latitudinal clines in floral display associated with adaptive evolution during a biological invasion [Dataset]. http://doi.org/10.5061/dryad.jdfn2z3jz
    Explore at:
    Dataset updated
    Jan 9, 2025
    Dataset provided by
    Dryad Digital Repository
    Authors
    Mia Akbar; Dale Moskoff; Spencer Barrett; Robert Colautti
    Description

    Premise: Flowering phenology strongly influences reproductive success in plants. Days to first flower is easy to quantify and widely used to characterize phenology, but reproductive fitness depends on the full schedule of flower production over time. Methods: We examined floral display traits associated with rapid adaptive evolution and range expansion among thirteen populations of Lythrum salicaria, sampled along a 10-degree latitudinal gradient in eastern North America. We grew these collections in a common garden field experiment at a mid-latitude site and quantified variation in flowering schedule shape using Principal Coordinates Analysis (PCoA) and quantitative metrics analogous to central moments of probability distributions (i.e., mean, variance, skew, and kurtosis). Key Results: Consistent with earlier evidence for adaptation to shorter growing seasons, we found that populations from higher latitudes had earlier start and mean flowering day, on average, when compared to popul..., , , # Data and analysis files from: Latitudinal clines in floral display associated with adaptive evolution during a biological invasion

    https://doi.org/10.5061/dryad.jdfn2z3jz

    Reference Information

    Provenance for this README

    • File name: README.md
    • Authors: Mia Akbar
    • Other contributors: Dale Moskoff, Spencer C.H. Barrett, Robert I. Colautti
    • Date created: 2024-05-30

    Dataset Version and Release History

    • Current Version:
      • Number: 1.0.0
      • Date: 2024-05-30
      • Persistent identifier: n/a
      • Summary of changes: n/a
    • Embargo Provenance: n/a
      • Scope of embargo: n/a
      • Embargo period: n/a

    Description of the data and file structure

    Methodological Information

    • Methods of data collection/generation: see publication for details

    Data and File Overview

    Summary Metrics

    • Data File count: 1
    • Total file size: 37 KB
    • File formats: .csv

    Naming Conventions

    • File naming scheme: The data file within the "Data" f...
  20. f

    Data from: Goodness-of-fit tests for Laplace, Gaussian and exponential power...

    • tandf.figshare.com
    pdf
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alain Desgagné; Pierre Lafaye de Micheaux; Frédéric Ouimet (2023). Goodness-of-fit tests for Laplace, Gaussian and exponential power distributions based on λ-th power skewness and kurtosis [Dataset]. http://doi.org/10.6084/m9.figshare.22308928.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Taylor & Francis
    Authors
    Alain Desgagné; Pierre Lafaye de Micheaux; Frédéric Ouimet
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Temperature data, like many other measurements in quantitative fields, are usually modelled using a normal distribution. However, some distributions can offer a better fit while avoiding underestimation of tail event probabilities. To this point, we extend Pearson's notions of skewness and kurtosis to build a powerful family of goodness-of-fit tests based on Rao's score for the exponential power distribution EPDλ(μ,σ), including tests for normality and Laplacity when λ is set to 1 or 2. We find the asymptotic distribution of our test statistic, which is the sum of the squares of two Z-scores, under the null and under local alternatives. We also develop an innovative regression strategy to obtain Z-scores that are nearly independent and distributed as standard Gaussians, resulting in a χ22 distribution valid for any sample size (up to very high precision for n≥20). The case λ=1 leads to a powerful test of fit for the Laplace(μ,σ) distribution, whose empirical power is superior to all 39 competitors in the literature, over a wide range of 400 alternatives. Theoretical proofs in this case are particularly challenging and substantial. We applied our tests to three temperature datasets. The new tests are implemented in the R package PoweR.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001

Data Sheet 1_The impact of distribution properties on sampling behavior.docx

Related Article
Explore at:
docxAvailable download formats
Dataset updated
Sep 30, 2025
Dataset provided by
Frontiers
Authors
Thai Quoc Cao; Benjamin Scheibehenne
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.

Search
Clear search
Close search
Google apps
Main menu