42 datasets found
  1. Dataset for: Some Remarks on the R2 for Clustering

    • wiley.figshare.com
    txt
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Wileyhttps://www.wiley.com/
    Authors
    Nicola Loperfido; Thaddeus Tarpey
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

  2. n

    Data from: Body temperature distributions of active diurnal lizards in three...

    • data.niaid.nih.gov
    • zenodo.org
    • +1more
    zip
    Updated Aug 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raymond B. Huey; Eric R. Pianka (2018). Body temperature distributions of active diurnal lizards in three deserts: skewed up or skewed down? [Dataset]. http://doi.org/10.5061/dryad.45g3s
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 4, 2018
    Dataset provided by
    University of Washington
    The University of Texas at Austin
    Authors
    Raymond B. Huey; Eric R. Pianka
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Africa, Australia, North America
    Description
    1. The performance of ectotherms integrated over time depends in part on the position and shape of the distribution of body temperatures (Tb) experienced during activity. For several complementary reasons, physiological ecologists have long expected that Tb distributions during activity should have a long left tail (left-skewed); but only infrequently have they quantified the magnitude and direction of Tb skewness in nature.
    2. To evaluate whether left-skewed Tb distributions are general for diurnal desert lizards, we compiled and analyzed Tb (∑ = 9,023 temperatures) from our own prior studies of active desert lizards on three continents (25 species in Western Australia, 10 in the Kalahari Desert of Africa, and 10 species in western North America). We gathered these data over several decades, using standardized techniques.
    3. Many species showed significantly left-skewed Tb distributions, even when records were restricted to summer months. However, magnitudes of skewness were always small, such that mean Tb were never more than 1°C lower than median Tb. The significance of Tb skewness was sensitive to sample size, and power tests reinforced this sensitivity.
    4. The magnitude of skewness was not obviously related to phylogeny, desert, body size, or median body temperature. Moreover, formal phylogenetic analysis is inappropriate because geography and phylogeny are confounded (that is, are highly collinear).
    5. Skewness might be limited if lizards pre-warm inside retreats before emerging in the morning, emerge only when operative temperatures are high enough to speed warming to activity Tb, or if cold lizards are especially wary and difficult to spot or catch. Telemetry studies may help evaluate these possibilities.
  3. f

    Data Sheet 1_The impact of distribution properties on sampling behavior.docx...

    • figshare.com
    docx
    Updated Sep 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001
    Explore at:
    docxAvailable download formats
    Dataset updated
    Sep 30, 2025
    Dataset provided by
    Frontiers
    Authors
    Thai Quoc Cao; Benjamin Scheibehenne
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.

  4. Data from: Adjusting Median and Trimmed-Mean Inflation Rates for Bias Based...

    • clevelandfed.org
    Updated Mar 24, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Federal Reserve Bank of Cleveland (2022). Adjusting Median and Trimmed-Mean Inflation Rates for Bias Based on Skewness [Dataset]. https://www.clevelandfed.org/publications/economic-commentary/2022/ec-202205-adjusting-median-and-trimmed-mean-inflation-rates-for-bias-based-on-skewness
    Explore at:
    Dataset updated
    Mar 24, 2022
    Dataset authored and provided by
    Federal Reserve Bank of Clevelandhttps://www.clevelandfed.org/
    Description

    Median and trimmed-mean inflation rates tend to be useful estimates of trend inflation over long periods, but they can exhibit persistent departures from the underlying trend over shorter horizons. In this Commentary, we document that the extent of this bias is related to the degree of skewness in the distribution of price changes. The shift in the skewness of the cross-sectional price-change distribution during the pandemic means that median PCE and trimmed-mean PCE inflation rates have recently been understating the trend in PCE inflation by about 15 and 35 basis points, respectively.

  5. n

    Data from: Improving structured population models with more realistic...

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Jun 14, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak (2019). Improving structured population models with more realistic representations of non-normal growth [Dataset]. http://doi.org/10.5061/dryad.t6c3573
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 14, 2019
    Dataset provided by
    Duke University
    Universitat de Barcelona
    University of Colorado Boulder
    Authors
    Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    NW Mediterranean Sea, Alaska, Niwot Ridge, USA, Kennicott Valley, Colorado
    Description
    1. Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction, and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet almost all published IPMs make an important assumption: that size-dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage. 2. Here we develop a flexible approach to modeling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size-dependent minimum and maximum sizes through quantile regression. Transformed data can then be modeled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long-lived plant, gorgonians, and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model. 3. The beta and skewed normal distributions accurately capture the mean, variance, and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated lifespan relative to IPMs that assume normally-distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modeled size range. 4. Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.
  6. n

    Data from: Selection on skewed characters and the paradox of stasis

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated Sep 8, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2017
    Dataset provided by
    Centre National de la Recherche Scientifique
    Authors
    Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.

  7. m

    Impact of limited data availability on the accuracy of project duration...

    • data.mendeley.com
    Updated Nov 22, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Naimeh Sadeghi (2022). Impact of limited data availability on the accuracy of project duration estimation in project networks [Dataset]. http://doi.org/10.17632/bjfdw6xbxw.3
    Explore at:
    Dataset updated
    Nov 22, 2022
    Authors
    Naimeh Sadeghi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations. Number of variables: 20 Number of cases/rows: 114240 Variable List: • Experiment ID: The ID of the experiment • Experiment for network: The ID of the experiment for each of the synthesized networks • Network ID: ID of the synthesized network • #Activities: Number of activities in the network, including start and finish activities • Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.) • Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
    • Fitted distribution type: Distribution type used to fit on sampled data • Sample size: Number of sampled data used for the experiment resembling limited data condition • Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network • Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network • Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network • Benchmark mean: Mean project duration in the benchmark stochastic project network • Benchmark variance: Variance project duration in the benchmark stochastic project network • Experiment 10th percentile: 10th percentile of project duration distribution for the experiment • Experiment 50th percentile: 50th percentile of project duration distribution for the experiment • Experiment 90th percentile: 90th percentile of project duration distribution for the experiment • Experiment mean: Mean of project duration distribution for the experiment • Experiment variance: Variance of project duration distribution for the experiment • K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration • distribution of the experiment • P_value: the P-value based on the distance calculated in the K-S test

  8. Skewness project raw data files and codes

    • figshare.com
    xlsx
    Updated Mar 14, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raunak Dey; Sreekanth K Manikandan (2022). Skewness project raw data files and codes [Dataset]. http://doi.org/10.6084/m9.figshare.17703269.v2
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    Mar 14, 2022
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Raunak Dey; Sreekanth K Manikandan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains raw data files and base codes to analyze them.A. The 'powerx_y.xlsx' files are the data files with the one dimensional trajectory of optically trapped probes modulated by an Ornstein-Uhlenbeck noise of given 'x' amplitude. For the corresponding diffusion amplitude A=0.1X(0.6X10-6)2 m2/s, x is labelled as '1'B. The codes are of three types. The skewness codes are used to calculate the skewness of the trajectory. The error_in_fit codes are used to calculate deviations from arcsine behavior. The sigma_exp codes point to the deviation of the mean from 0.5. All the codes are written three times to look ar T+, Tlast and Tmax.C. More information can be found in the manuscript.

  9. 4

    Supplementary data for the paper "Why psychologists should not default to...

    • data.4tu.nl
    zip
    Updated Apr 28, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 28, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Joost de Winter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.

  10. Data from: Food web interaction strength distributions are conserved by...

    • search.datacite.org
    • datadryad.org
    Updated 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak (2019). Data from: Food web interaction strength distributions are conserved by greater variation between than within predator-prey pairs [Dataset]. http://doi.org/10.5061/dryad.sr6888t
    Explore at:
    Dataset updated
    2019
    Dataset provided by
    DataCitehttps://www.datacite.org/
    Dryad
    Authors
    Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    National Science Foundation
    Description

    Species interactions in food webs are usually recognized as dynamic, varying across species, space and time due to biotic and abiotic drivers. Yet food webs also show emergent properties that appear consistent, such as a skewed frequency distribution of interaction strengths (many weak, few strong). Reconciling these two properties requires an understanding of the variation in pairwise interaction strengths and its underlying mechanisms. We estimated stream sculpin feeding rates in three seasons at nine sites in Oregon to examine variation in trophic interaction strengths both across and within predator-prey pairs. Predator and prey densities, prey body mass, and abiotic factors were considered as putative drivers of within-pair variation over space and time. We hypothesized that consistently skewed interaction strength distributions could result if individual interaction strengths show relatively little variation, or alternatively, if interaction strengths vary but shift in ways that conserve their overall frequency distribution. Feeding rate distributions remained consistently and positively skewed across all sites and seasons. The mean coefficient of variation in feeding rates within each of 25 focal species pairs across surveys was less than half the mean coefficient of variation seen across species pairs within a survey. The rank order of feeding rates also remained conserved across streams, seasons and individual surveys. On average, feeding rates on each prey taxon nonetheless varied by a hundredfold, with some feeding rates showing more variation in space and others in time. In general, feeding rates increased with prey density and decreased with high stream flows and low water temperatures, although for nearly half of all species pairs, factors other than prey density explained the most variation. Our findings show that although individual interaction strengths exhibit considerable variation in space and time, they can nonetheless remain relatively consistent, and thus predictable, compared to the even larger variation that occurs across species pairs. These results highlight how the ecological scale of inference can strongly shape conclusions about interaction strength consistency and collectively help reconcile how the skewed nature of interaction strength distributions can persist in highly dynamic food webs.

  11. n

    Writing_vs_Tapping(Arabic_English)

    • narcis.nl
    • data.mendeley.com
    Updated Dec 8, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lee, B (via Mendeley Data) (2020). Writing_vs_Tapping(Arabic_English) [Dataset]. http://doi.org/10.17632/j4mvtjmp5j.1
    Explore at:
    Dataset updated
    Dec 8, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Lee, B (via Mendeley Data)
    Description

    This is the dataset reflects the recorded times that it took for 72 participants to transcribe an Arabic text, and 78 participants to transcribe an English text, both by paper and by smartphone. (*Note that Participant 48 in the English subgroup was identified as an outlier as times for smartphone entry were over 5 SD away from the mean.) All data points are times (in seconds).

    It was hypothesized, based on precursor research, that handwriting would be faster than smartphone entry for participants writing in their second language. This hypothesis was supported by this data. Also, the non-normal distributions of the English subgroups (the second language of the participants) is typical of research based on self-paced actions (in this case, self-paced writing). Both subgroups of the English data were positively skewed.

  12. Geophone Sensor Dataset

    • kaggle.com
    zip
    Updated Dec 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Furkan Sezgin (2024). Geophone Sensor Dataset [Dataset]. https://www.kaggle.com/datasets/sezginfurkan/geophone-sensor-dataset
    Explore at:
    zip(75617 bytes)Available download formats
    Dataset updated
    Dec 26, 2024
    Authors
    Furkan Sezgin
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains vibration data collected from a geoscope sensor to analyze human activities (walking, running, and waiting). The data is segmented into 3-second time windows, with each window containing 120 rows of data per person. The dataset consists of 1800 rows of data from five individuals: Furkan, Enes, Yusuf, Alihan and Emir.

    Each person’s activity is classified into one of the three categories: walking, running, or standing still. The data includes both statistical and frequency-domain features extracted from the raw vibration signals, detailed below:

    Statistical Features: - Mean: The average value of the signal over the time window.- - Median: The middle value of the signal, dividing the data into two equal halves. - Standard Deviation: A measure of how much the signal deviates from its mean, indicating the signal's variability. - Minimum: The smallest value in the signal during the time window. - Maximum: The largest value in the signal during the time window. - First Quartile (Q1): The median of the lower half of the data, representing the 25th percentile. - Third Quartile (Q3): The median of the upper half of the data, representing the 75th percentile. - Skewness: A measure of the asymmetry of the signal distribution, showing whether the data is skewed to the left or right.

    Frequency-Domain Features: - Dominant Frequency: The frequency with the highest power, providing insights into the primary periodicity of the signal. - Signal Energy: The total energy of the signal, representing the sum of the squared signal values over the time window.

    Dataset Overview: - Total Rows: 1800 - Number of Individuals: 5 (Furkan, Enes, Yusuf, Alihan, Emir) - Activity Types: Walking, Running, Waiting (Standing Still) - Time Frame: 3-second time windows (120 rows per individual for each activity) - Features: Statistical and frequency-domain features (as described above)

    This dataset is suitable for training models on activity recognition, user identification, and other related tasks. It provides rich, detailed features that can be used for various classification and analysis applications.

  13. Loan Default Risk Prediction Dataset

    • kaggle.com
    zip
    Updated Feb 1, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Himel Sarder (2025). Loan Default Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/himelsarder/loan-default-risk-prediction-dataset/code
    Explore at:
    zip(3531 bytes)Available download formats
    Dataset updated
    Feb 1, 2025
    Authors
    Himel Sarder
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    📖 Dataset Overview

    This dataset is designed for financial risk assessment and loan default prediction using machine learning techniques. It includes 300 records, each representing an individual with financial attributes that influence the likelihood of loan default.

    📊 Features & Data Structure

    The dataset contains the following columns:

    Column NameTypeDescription
    Retirement_AgefloatAge at which the individual retires (left-skewed distribution).
    Debt_AmountfloatTotal debt held by the individual in dollars (right-skewed distribution).
    Monthly_SavingsfloatAverage monthly savings in dollars (normally distributed).
    Loan_Default_Riskint (0/1)Target variable: 1 = Default, 0 = No Default.
    • Highly Left-Skewed Column: Retirement Age – Most people retire at older ages, with fewer early retirees.
    • Highly Right-Skewed Column: Debt Amount – Most people have low debt, but a few have very high debt.
    • Totally Symmetric Column: Monthly Savings – Normally distributed around an average.

    📌 Data Generation & Logic

    The dataset was synthetically created using statistical distributions that mimic real-world financial behavior:

    🔹 Retirement Age (Left-Skewed): Generated using a transformed normal distribution to ensure most values are high (60-85).
    🔹 Debt Amount (Right-Skewed): Generated using a log-normal distribution, where most people have low debt, but a few have very high debt.
    🔹 Monthly Savings (Symmetric): Normally distributed with mean $2000$ and standard deviation $500$, clipped between $500-$5000.
    🔹 Loan Default Risk (Target Variable): Computed using a logistic function, where:
    - Lower retirement age ⬆ default risk
    - Higher debt ⬆ default risk
    - Higher savings ⬇ default risk
    - The probability threshold was adjusted to balance 0s and 1s.

  14. Model evaluation for COVID-19 deaths.

    • plos.figshare.com
    xls
    Updated Jun 6, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa-Thuong Le; Xiyue Liao (2024). Model evaluation for COVID-19 deaths. [Dataset]. http://doi.org/10.1371/journal.pone.0302324.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Teresa-Thuong Le; Xiyue Liao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.

  15. H

    A Correction for Structural Equation Modeling Fit Indices Under Missingness:...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jan 12, 2015
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cailey E. Fitzgerald (2015). A Correction for Structural Equation Modeling Fit Indices Under Missingness: Adapting the Root Mean Squared Error of Approximation to Conditions of Missing Data [Dataset]. http://doi.org/10.7910/DVN/28657
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 12, 2015
    Dataset provided by
    Harvard Dataverse
    Authors
    Cailey E. Fitzgerald
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Missing data is a frequent occurrence in both small and large datasets. Among other things, missingness may be a result of coding or computer error, participant absences, or it may be intentional, as in a planned missing design. Whatever the cause, the problem of how to approach a dataset with holes is of much relevance in scientific research. First, missingness is approached as a theoretical construct, and its impacts on data analysis are encountered. I discuss missingness as it relates to structural equation modeling and model fit indices, specifically its interaction with the Root Mean Square Error of Approximation (RMSEA). Data simulation is used to show that RMSEA has a downward bias with missing data, yielding skewed fit indices. Two alternative formulas for RMSEA calculation are proposed: one correcting degrees of freedom and one using Kullback-Leibler divergence to result in an RMSEA calculation which is relatively independent of missingness. Simulations are conducted in Java, with results indicating that the Kullback-Leibler divergence provides a better correction for RMSEA calculation. Next, I approach missingness in an applied manner with an existing large dataset examining ideology measures. The researchers assessed ideology using a planned missingness design, resulting in high proportions of missing data. Factor analysis was performed to gauge uniqueness of ideology measures.

  16. T

    Drugs

    • data.cincinnati-oh.gov
    csv, xlsx, xml
    Updated Nov 26, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Cincinnati (2025). Drugs [Dataset]. https://data.cincinnati-oh.gov/Safety/Drugs/3gx7-se9a
    Explore at:
    xlsx, xml, csvAvailable download formats
    Dataset updated
    Nov 26, 2025
    Authors
    City of Cincinnati
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Calls For Service are the events captured in an agency’s Computer-Aided Dispatch (CAD) system used to facilitate incident response.

    This dataset includes both proactive and reactive police incident data.

    The source of this data is the City of Cincinnati's computer-aided dispatch (CAD) database.

    This data is updated daily.

    DISCLAIMER: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.

  17. s

    Northern Ireland Annual Descriptive House Price Statistics (LGD Level) -...

    • ckan.publishing.service.gov.uk
    Updated Feb 22, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2020). Northern Ireland Annual Descriptive House Price Statistics (LGD Level) - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/northern-ireland-annual-descriptive-house-price-statistics-lgd-level
    Explore at:
    Dataset updated
    Feb 22, 2020
    License

    Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
    License information was derived automatically

    Area covered
    Ireland, Northern Ireland
    Description

    Annual descriptive price statistics for each calendar year 2005 – 2024 for 11 Local Government Districts in Northern Ireland. The statistics include: • Minimum sale price • Lower quartile sale price • Median sale price • Simple Mean sale price • Upper Quartile sale price • Maximum sale price • Number of verified sales Prices are available where at least 30 sales were recorded in the area within the calendar year which could be included in the regression model i.e. the following sales are excluded: • Non Arms-Length sales • sales of properties where the habitable space are less than 30m2 or greater than 1000m2 • sales less than £20,000. Annual median or simple mean prices should not be used to calculate the property price change over time. The quality (where quality refers to the combination of all characteristics of a residential property, both physical and locational) of the properties that are sold may differ from one time period to another. For example, sales in one quarter could be disproportionately skewed towards low-quality properties, therefore producing a biased estimate of average price. The median and simple mean prices are not ‘standardised’ and so the varying mix of properties sold in each quarter could give a false impression of the actual change in prices. In order to calculate the pure property price change over time it is necessary to compare like with like, and this can only be achieved if the ‘characteristics-mix’ of properties traded is standardised. To calculate pure property change over time please use the standardised prices in the NI House Price Index Detailed Statistics file.

  18. f

    Computation time comparison.

    • plos.figshare.com
    xls
    Updated Jun 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Teresa-Thuong Le; Xiyue Liao (2024). Computation time comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0302324.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 6, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Teresa-Thuong Le; Xiyue Liao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.

  19. n

    Data from: A broader flight season for Norway’s Odonata across a century and...

    • data.niaid.nih.gov
    • datadryad.org
    zip
    Updated May 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Michael Patten; Brittany Benson (2023). A broader flight season for Norway’s Odonata across a century and a half [Dataset]. http://doi.org/10.5061/dryad.8pk0p2nsw
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 5, 2023
    Dataset provided by
    Nord University
    Authors
    Michael Patten; Brittany Benson
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Area covered
    Norway
    Description

    As global climate continues to change, so too will phenology of a wide range of insects. Changes in flight season usually are characterised as shifts to earlier dates or means, with attention less often paid to flight season breadth or whether seasons are now skewed. We amassed flight season data for the insect order Odonata, the dragonflies and damselflies, for Norway over the past century-and-a-half to examine the form of flight season change. By means of Bayesian analyses that incorporated uncertainty relative to annual variability in survey effort, we estimated shifts in flight season mean, breadth, and skew. We focussed on flight season breadth, positing that it will track documented growing season expansion. A specific mechanism explored was shifts in voltinism, the number of generations per year, which tends to increase with warming. We found strong evidence for an increase in flight season breadth but much less for a shift in mean, with any shift of the latter tending toward a later mean. Skew has become rightward for suborder Zygoptera, the damselflies, but not for Anisoptera, the dragonflies, or for the Odonata as a whole. We found weak support for voltinism as a predictor of broader flight season; instead, voltinism acted interactively with use of human-modified habitats, including decrease in shading (e.g., from timber extraction). Other potential mechanisms that link warming with broadening of flight season include protracted emergence and cohort splitting, both of which have been documented in the Odonata. It is likely that warming-induced broadening of flight seasons of these widespread insect predators will have wide-ranging consequences for freshwater ecosystems. Methods Data was extracted from Artsdatabanken, a public database for Norway. Data were cleaned, and useable records served as the basis for analyses.

  20. T

    Overdose Data (CFD)

    • data.cincinnati-oh.gov
    csv, xlsx, xml
    Updated Dec 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    City of Cincinnati (2025). Overdose Data (CFD) [Dataset]. https://data.cincinnati-oh.gov/Safety/Overdose-Data-CFD-/n6qn-tghq
    Explore at:
    csv, xml, xlsxAvailable download formats
    Dataset updated
    Dec 2, 2025
    Dataset authored and provided by
    City of Cincinnati
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Data Description: Fire Incident data includes all fire incident responses. This includes emergency medical services (EMS) calls, fires, rescue incidents, and all other services handled by the Fire Department. All runs are coded according to classification: for EMS, this includes ALS (advanced life support); BLS (basic life support); etc.

    Data Creation: This data is created when a run is entered into the City of Cincinnati’s computer-aided dispatch (CAD) database.

    Data Created By: The source of this data is the City of Cincinnati's computer aided dispatch (CAD) database.

    Refresh Frequency: This data is updated daily.

    CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/6jrc-cmn5

    Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.

    Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).

    Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad

    Disclaimer: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Organization logo

Dataset for: Some Remarks on the R2 for Clustering

Related Article
Explore at:
txtAvailable download formats
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Nicola Loperfido; Thaddeus Tarpey
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

Search
Clear search
Close search
Google apps
Main menu