42 datasets found

Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wileyhttps://www.wiley.com/
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
n
Data from: Body temperature distributions of active diurnal lizards in three...
data.niaid.nih.gov
zenodo.org
+1more
zip
Updated Aug 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raymond B. Huey; Eric R. Pianka (2018). Body temperature distributions of active diurnal lizards in three deserts: skewed up or skewed down? [Dataset]. http://doi.org/10.5061/dryad.45g3s
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.45g3s
Dataset updated
Aug 4, 2018
Dataset provided by
University of Washington
The University of Texas at Austin
Authors
Raymond B. Huey; Eric R. Pianka
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Africa, Australia, North America
Description
The performance of ectotherms integrated over time depends in part on the position and shape of the distribution of body temperatures (Tb) experienced during activity. For several complementary reasons, physiological ecologists have long expected that Tb distributions during activity should have a long left tail (left-skewed); but only infrequently have they quantified the magnitude and direction of Tb skewness in nature.

To evaluate whether left-skewed Tb distributions are general for diurnal desert lizards, we compiled and analyzed Tb (∑ = 9,023 temperatures) from our own prior studies of active desert lizards on three continents (25 species in Western Australia, 10 in the Kalahari Desert of Africa, and 10 species in western North America). We gathered these data over several decades, using standardized techniques.

Many species showed significantly left-skewed Tb distributions, even when records were restricted to summer months. However, magnitudes of skewness were always small, such that mean Tb were never more than 1°C lower than median Tb. The significance of Tb skewness was sensitive to sample size, and power tests reinforced this sensitivity.

The magnitude of skewness was not obviously related to phylogeny, desert, body size, or median body temperature. Moreover, formal phylogenetic analysis is inappropriate because geography and phylogeny are confounded (that is, are highly collinear).

Skewness might be limited if lizards pre-warm inside retreats before emerging in the morning, emerge only when operative temperatures are high enough to speed warming to activity Tb, or if cold lizards are especially wary and difficult to spot or catch. Telemetry studies may help evaluate these possibilities.
f
Data Sheet 1_The impact of distribution properties on sampling behavior.docx...
figshare.com
docx
Updated Sep 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Thai Quoc Cao; Benjamin Scheibehenne (2025). Data Sheet 1_The impact of distribution properties on sampling behavior.docx [Dataset]. http://doi.org/10.3389/fpsyg.2025.1597227.s001
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpsyg.2025.1597227.s001
Dataset updated
Sep 30, 2025
Dataset provided by
Frontiers
Authors
Thai Quoc Cao; Benjamin Scheibehenne
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.
Data from: Adjusting Median and Trimmed-Mean Inflation Rates for Bias Based...
clevelandfed.org
Updated Mar 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Federal Reserve Bank of Cleveland (2022). Adjusting Median and Trimmed-Mean Inflation Rates for Bias Based on Skewness [Dataset]. https://www.clevelandfed.org/publications/economic-commentary/2022/ec-202205-adjusting-median-and-trimmed-mean-inflation-rates-for-bias-based-on-skewness
Explore at:
Dataset updated
Mar 24, 2022
Dataset authored and provided by
Federal Reserve Bank of Clevelandhttps://www.clevelandfed.org/
Description
Median and trimmed-mean inflation rates tend to be useful estimates of trend inflation over long periods, but they can exhibit persistent departures from the underlying trend over shorter horizons. In this Commentary, we document that the extent of this bias is related to the degree of skewness in the distribution of price changes. The shift in the skewness of the cross-sectional price-change distribution during the pandemic means that median PCE and trimmed-mean PCE inflation rates have recently been understating the trend in PCE inflation by about 15 and 35 basis points, respectively.
n
Data from: Improving structured population models with more realistic...
data.niaid.nih.gov
search.dataone.org
+1more
zip
Updated Jun 14, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak (2019). Improving structured population models with more realistic representations of non-normal growth [Dataset]. http://doi.org/10.5061/dryad.t6c3573
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.t6c3573
Dataset updated
Jun 14, 2019
Dataset provided by
Duke University
Universitat de Barcelona
University of Colorado Boulder
Authors
Megan L. Peterson; William Morris; Cristina Linares; Daniel Doak
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
NW Mediterranean Sea, Alaska, Niwot Ridge, USA, Kennicott Valley, Colorado
Description
Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction, and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet almost all published IPMs make an important assumption: that size-dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage. 2. Here we develop a flexible approach to modeling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size-dependent minimum and maximum sizes through quantile regression. Transformed data can then be modeled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long-lived plant, gorgonians, and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model. 3. The beta and skewed normal distributions accurately capture the mean, variance, and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated lifespan relative to IPMs that assume normally-distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modeled size range. 4. Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.
n
Data from: Selection on skewed characters and the paradox of stasis
data.niaid.nih.gov
datadryad.org
zip
Updated Sep 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin (2017). Selection on skewed characters and the paradox of stasis [Dataset]. http://doi.org/10.5061/dryad.pt07g
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.pt07g
Dataset updated
Sep 8, 2017
Dataset provided by
Centre National de la Recherche Scientifique
Authors
Suzanne Bonamour; Céline Teplitsky; Anne Charmantier; Pierre-André Crochet; Luis-Miguel Chevin
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Description
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
m
Impact of limited data availability on the accuracy of project duration...
data.mendeley.com
Updated Nov 22, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Naimeh Sadeghi (2022). Impact of limited data availability on the accuracy of project duration estimation in project networks [Dataset]. http://doi.org/10.17632/bjfdw6xbxw.3
Explore at:
Unique identifier
https://doi.org/10.17632/bjfdw6xbxw.3
Dataset updated
Nov 22, 2022
Authors
Naimeh Sadeghi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations. Number of variables: 20 Number of cases/rows: 114240 Variable List: • Experiment ID: The ID of the experiment • Experiment for network: The ID of the experiment for each of the synthesized networks • Network ID: ID of the synthesized network • #Activities: Number of activities in the network, including start and finish activities • Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.) • Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
• Fitted distribution type: Distribution type used to fit on sampled data • Sample size: Number of sampled data used for the experiment resembling limited data condition • Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network • Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network • Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network • Benchmark mean: Mean project duration in the benchmark stochastic project network • Benchmark variance: Variance project duration in the benchmark stochastic project network • Experiment 10th percentile: 10th percentile of project duration distribution for the experiment • Experiment 50th percentile: 50th percentile of project duration distribution for the experiment • Experiment 90th percentile: 90th percentile of project duration distribution for the experiment • Experiment mean: Mean of project duration distribution for the experiment • Experiment variance: Variance of project duration distribution for the experiment • K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration • distribution of the experiment • P_value: the P-value based on the distance calculated in the K-S test
Skewness project raw data files and codes
figshare.com
xlsx
Updated Mar 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Raunak Dey; Sreekanth K Manikandan (2022). Skewness project raw data files and codes [Dataset]. http://doi.org/10.6084/m9.figshare.17703269.v2
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.17703269.v2
Dataset updated
Mar 14, 2022
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Raunak Dey; Sreekanth K Manikandan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository contains raw data files and base codes to analyze them.A. The 'powerx_y.xlsx' files are the data files with the one dimensional trajectory of optically trapped probes modulated by an Ornstein-Uhlenbeck noise of given 'x' amplitude. For the corresponding diffusion amplitude A=0.1X(0.6X10-6)2 m2/s, x is labelled as '1'B. The codes are of three types. The skewness codes are used to calculate the skewness of the trajectory. The error_in_fit codes are used to calculate deviations from arcsine behavior. The sigma_exp codes point to the deviation of the mean from 0.5. All the codes are written three times to look ar T+, Tlast and Tmax.C. More information can be found in the manuscript.
4
Supplementary data for the paper "Why psychologists should not default to...
data.4tu.nl
zip
Updated Apr 28, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
Dataset updated
Apr 28, 2025
Dataset provided by
4TU.ResearchData
Authors
Joost de Winter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.
Data from: Food web interaction strength distributions are conserved by...
search.datacite.org
datadryad.org
Updated 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak (2019). Data from: Food web interaction strength distributions are conserved by greater variation between than within predator-prey pairs [Dataset]. http://doi.org/10.5061/dryad.sr6888t
Explore at:
Unique identifier
https://doi.org/10.5061/dryad.sr6888t
Dataset updated
2019
Dataset provided by
DataCitehttps://www.datacite.org/
Dryad
Authors
Daniel L. Preston; Landon P. Falke; Jeremy S. Henderson; Mark Novak
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
National Science Foundation
Description
Species interactions in food webs are usually recognized as dynamic, varying across species, space and time due to biotic and abiotic drivers. Yet food webs also show emergent properties that appear consistent, such as a skewed frequency distribution of interaction strengths (many weak, few strong). Reconciling these two properties requires an understanding of the variation in pairwise interaction strengths and its underlying mechanisms. We estimated stream sculpin feeding rates in three seasons at nine sites in Oregon to examine variation in trophic interaction strengths both across and within predator-prey pairs. Predator and prey densities, prey body mass, and abiotic factors were considered as putative drivers of within-pair variation over space and time. We hypothesized that consistently skewed interaction strength distributions could result if individual interaction strengths show relatively little variation, or alternatively, if interaction strengths vary but shift in ways that conserve their overall frequency distribution. Feeding rate distributions remained consistently and positively skewed across all sites and seasons. The mean coefficient of variation in feeding rates within each of 25 focal species pairs across surveys was less than half the mean coefficient of variation seen across species pairs within a survey. The rank order of feeding rates also remained conserved across streams, seasons and individual surveys. On average, feeding rates on each prey taxon nonetheless varied by a hundredfold, with some feeding rates showing more variation in space and others in time. In general, feeding rates increased with prey density and decreased with high stream flows and low water temperatures, although for nearly half of all species pairs, factors other than prey density explained the most variation. Our findings show that although individual interaction strengths exhibit considerable variation in space and time, they can nonetheless remain relatively consistent, and thus predictable, compared to the even larger variation that occurs across species pairs. These results highlight how the ecological scale of inference can strongly shape conclusions about interaction strength consistency and collectively help reconcile how the skewed nature of interaction strength distributions can persist in highly dynamic food webs.
n
Writing_vs_Tapping(Arabic_English)
narcis.nl
data.mendeley.com
Updated Dec 8, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lee, B (via Mendeley Data) (2020). Writing_vs_Tapping(Arabic_English) [Dataset]. http://doi.org/10.17632/j4mvtjmp5j.1
Explore at:
Unique identifier
https://doi.org/10.17632/j4mvtjmp5j.1
Dataset updated
Dec 8, 2020
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Lee, B (via Mendeley Data)
Description
This is the dataset reflects the recorded times that it took for 72 participants to transcribe an Arabic text, and 78 participants to transcribe an English text, both by paper and by smartphone. (*Note that Participant 48 in the English subgroup was identified as an outlier as times for smartphone entry were over 5 SD away from the mean.) All data points are times (in seconds).

It was hypothesized, based on precursor research, that handwriting would be faster than smartphone entry for participants writing in their second language. This hypothesis was supported by this data. Also, the non-normal distributions of the English subgroups (the second language of the participants) is typical of research based on self-paced actions (in this case, self-paced writing). Both subgroups of the English data were positively skewed.
Geophone Sensor Dataset
kaggle.com
zip
Updated Dec 26, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Furkan Sezgin (2024). Geophone Sensor Dataset [Dataset]. https://www.kaggle.com/datasets/sezginfurkan/geophone-sensor-dataset
Explore at:
zip(75617 bytes)Available download formats
Dataset updated
Dec 26, 2024
Authors
Furkan Sezgin
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
This dataset contains vibration data collected from a geoscope sensor to analyze human activities (walking, running, and waiting). The data is segmented into 3-second time windows, with each window containing 120 rows of data per person. The dataset consists of 1800 rows of data from five individuals: Furkan, Enes, Yusuf, Alihan and Emir.

Each person’s activity is classified into one of the three categories: walking, running, or standing still. The data includes both statistical and frequency-domain features extracted from the raw vibration signals, detailed below:

Statistical Features: - Mean: The average value of the signal over the time window.- - Median: The middle value of the signal, dividing the data into two equal halves. - Standard Deviation: A measure of how much the signal deviates from its mean, indicating the signal's variability. - Minimum: The smallest value in the signal during the time window. - Maximum: The largest value in the signal during the time window. - First Quartile (Q1): The median of the lower half of the data, representing the 25th percentile. - Third Quartile (Q3): The median of the upper half of the data, representing the 75th percentile. - Skewness: A measure of the asymmetry of the signal distribution, showing whether the data is skewed to the left or right.

Frequency-Domain Features: - Dominant Frequency: The frequency with the highest power, providing insights into the primary periodicity of the signal. - Signal Energy: The total energy of the signal, representing the sum of the squared signal values over the time window.

Dataset Overview: - Total Rows: 1800 - Number of Individuals: 5 (Furkan, Enes, Yusuf, Alihan, Emir) - Activity Types: Walking, Running, Waiting (Standing Still) - Time Frame: 3-second time windows (120 rows per individual for each activity) - Features: Statistical and frequency-domain features (as described above)

This dataset is suitable for training models on activity recognition, user identification, and other related tasks. It provides rich, detailed features that can be used for various classification and analysis applications.

Loan Default Risk Prediction Dataset

kaggle.com

zip

Updated Feb 1, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

Himel Sarder (2025). Loan Default Risk Prediction Dataset [Dataset]. https://www.kaggle.com/datasets/himelsarder/loan-default-risk-prediction-dataset/code

Explore at:

zip(3531 bytes)Available download formats

Dataset updated

Feb 1, 2025

Authors

Himel Sarder

License

https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

Description

📖 Dataset Overview

This dataset is designed for financial risk assessment and loan default prediction using machine learning techniques. It includes 300 records, each representing an individual with financial attributes that influence the likelihood of loan default.

📊 Features & Data Structure

The dataset contains the following columns:

Column Name	Type	Description
Retirement_Age	float	Age at which the individual retires (left-skewed distribution).
Debt_Amount	float	Total debt held by the individual in dollars (right-skewed distribution).
Monthly_Savings	float	Average monthly savings in dollars (normally distributed).
Loan_Default_Risk	int (0/1)	Target variable: 1 = Default, 0 = No Default.

Highly Left-Skewed Column: Retirement Age – Most people retire at older ages, with fewer early retirees.
Highly Right-Skewed Column: Debt Amount – Most people have low debt, but a few have very high debt.
Totally Symmetric Column: Monthly Savings – Normally distributed around an average.

📌 Data Generation & Logic

The dataset was synthetically created using statistical distributions that mimic real-world financial behavior:

🔹 Retirement Age (Left-Skewed): Generated using a transformed normal distribution to ensure most values are high (60-85).
🔹 Debt Amount (Right-Skewed): Generated using a log-normal distribution, where most people have low debt, but a few have very high debt.
🔹 Monthly Savings (Symmetric): Normally distributed with mean $2000$ and standard deviation $500$, clipped between $500-$5000.
🔹 Loan Default Risk (Target Variable): Computed using a logistic function, where:
- Lower retirement age ⬆ default risk
- Higher debt ⬆ default risk
- Higher savings ⬇ default risk
- The probability threshold was adjusted to balance 0s and 1s.

Model evaluation for COVID-19 deaths.
plos.figshare.com
xls
Updated Jun 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teresa-Thuong Le; Xiyue Liao (2024). Model evaluation for COVID-19 deaths. [Dataset]. http://doi.org/10.1371/journal.pone.0302324.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302324.t003
Dataset updated
Jun 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Teresa-Thuong Le; Xiyue Liao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
H
A Correction for Structural Equation Modeling Fit Indices Under Missingness:...
dataverse.harvard.edu
dataone.org
Updated Jan 12, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cailey E. Fitzgerald (2015). A Correction for Structural Equation Modeling Fit Indices Under Missingness: Adapting the Root Mean Squared Error of Approximation to Conditions of Missing Data [Dataset]. http://doi.org/10.7910/DVN/28657
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/28657
Dataset updated
Jan 12, 2015
Dataset provided by
Harvard Dataverse
Authors
Cailey E. Fitzgerald
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Missing data is a frequent occurrence in both small and large datasets. Among other things, missingness may be a result of coding or computer error, participant absences, or it may be intentional, as in a planned missing design. Whatever the cause, the problem of how to approach a dataset with holes is of much relevance in scientific research. First, missingness is approached as a theoretical construct, and its impacts on data analysis are encountered. I discuss missingness as it relates to structural equation modeling and model fit indices, specifically its interaction with the Root Mean Square Error of Approximation (RMSEA). Data simulation is used to show that RMSEA has a downward bias with missing data, yielding skewed fit indices. Two alternative formulas for RMSEA calculation are proposed: one correcting degrees of freedom and one using Kullback-Leibler divergence to result in an RMSEA calculation which is relatively independent of missingness. Simulations are conducted in Java, with results indicating that the Kullback-Leibler divergence provides a better correction for RMSEA calculation. Next, I approach missingness in an applied manner with an existing large dataset examining ideology measures. The researchers assessed ideology using a planned missingness design, resulting in high proportions of missing data. Factor analysis was performed to gauge uniqueness of ideology measures.
T
Drugs
data.cincinnati-oh.gov
csv, xlsx, xml
Updated Nov 26, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Cincinnati (2025). Drugs [Dataset]. https://data.cincinnati-oh.gov/Safety/Drugs/3gx7-se9a
Explore at:
xlsx, xml, csvAvailable download formats
Dataset updated
Nov 26, 2025
Authors
City of Cincinnati
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Calls For Service are the events captured in an agency’s Computer-Aided Dispatch (CAD) system used to facilitate incident response.

This dataset includes both proactive and reactive police incident data.

The source of this data is the City of Cincinnati's computer-aided dispatch (CAD) database.

This data is updated daily.

DISCLAIMER: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.
s
Northern Ireland Annual Descriptive House Price Statistics (LGD Level) -...
ckan.publishing.service.gov.uk
Updated Feb 22, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2020). Northern Ireland Annual Descriptive House Price Statistics (LGD Level) - Dataset - data.gov.uk [Dataset]. https://ckan.publishing.service.gov.uk/dataset/northern-ireland-annual-descriptive-house-price-statistics-lgd-level
Explore at:
Dataset updated
Feb 22, 2020
License
Open Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Area covered
Ireland, Northern Ireland
Description
Annual descriptive price statistics for each calendar year 2005 – 2024 for 11 Local Government Districts in Northern Ireland. The statistics include: • Minimum sale price • Lower quartile sale price • Median sale price • Simple Mean sale price • Upper Quartile sale price • Maximum sale price • Number of verified sales Prices are available where at least 30 sales were recorded in the area within the calendar year which could be included in the regression model i.e. the following sales are excluded: • Non Arms-Length sales • sales of properties where the habitable space are less than 30m2 or greater than 1000m2 • sales less than £20,000. Annual median or simple mean prices should not be used to calculate the property price change over time. The quality (where quality refers to the combination of all characteristics of a residential property, both physical and locational) of the properties that are sold may differ from one time period to another. For example, sales in one quarter could be disproportionately skewed towards low-quality properties, therefore producing a biased estimate of average price. The median and simple mean prices are not ‘standardised’ and so the varying mix of properties sold in each quarter could give a false impression of the actual change in prices. In order to calculate the pure property price change over time it is necessary to compare like with like, and this can only be achieved if the ‘characteristics-mix’ of properties traded is standardised. To calculate pure property change over time please use the standardised prices in the NI House Price Index Detailed Statistics file.
f
Computation time comparison.
plos.figshare.com
xls
Updated Jun 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Teresa-Thuong Le; Xiyue Liao (2024). Computation time comparison. [Dataset]. http://doi.org/10.1371/journal.pone.0302324.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0302324.t004
Dataset updated
Jun 6, 2024
Dataset provided by
PLOS ONE
Authors
Teresa-Thuong Le; Xiyue Liao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
n
Data from: A broader flight season for Norway’s Odonata across a century and...
data.niaid.nih.gov
datadryad.org
zip
Updated May 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Michael Patten; Brittany Benson (2023). A broader flight season for Norway’s Odonata across a century and a half [Dataset]. http://doi.org/10.5061/dryad.8pk0p2nsw
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.8pk0p2nsw
Dataset updated
May 5, 2023
Dataset provided by
Nord University
Authors
Michael Patten; Brittany Benson
License
https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Area covered
Norway
Description
As global climate continues to change, so too will phenology of a wide range of insects. Changes in flight season usually are characterised as shifts to earlier dates or means, with attention less often paid to flight season breadth or whether seasons are now skewed. We amassed flight season data for the insect order Odonata, the dragonflies and damselflies, for Norway over the past century-and-a-half to examine the form of flight season change. By means of Bayesian analyses that incorporated uncertainty relative to annual variability in survey effort, we estimated shifts in flight season mean, breadth, and skew. We focussed on flight season breadth, positing that it will track documented growing season expansion. A specific mechanism explored was shifts in voltinism, the number of generations per year, which tends to increase with warming. We found strong evidence for an increase in flight season breadth but much less for a shift in mean, with any shift of the latter tending toward a later mean. Skew has become rightward for suborder Zygoptera, the damselflies, but not for Anisoptera, the dragonflies, or for the Odonata as a whole. We found weak support for voltinism as a predictor of broader flight season; instead, voltinism acted interactively with use of human-modified habitats, including decrease in shading (e.g., from timber extraction). Other potential mechanisms that link warming with broadening of flight season include protracted emergence and cohort splitting, both of which have been documented in the Odonata. It is likely that warming-induced broadening of flight seasons of these widespread insect predators will have wide-ranging consequences for freshwater ecosystems. Methods Data was extracted from Artsdatabanken, a public database for Norway. Data were cleaned, and useable records served as the basis for analyses.
T
Overdose Data (CFD)
data.cincinnati-oh.gov
csv, xlsx, xml
Updated Dec 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
City of Cincinnati (2025). Overdose Data (CFD) [Dataset]. https://data.cincinnati-oh.gov/Safety/Overdose-Data-CFD-/n6qn-tghq
Explore at:
csv, xml, xlsxAvailable download formats
Dataset updated
Dec 2, 2025
Dataset authored and provided by
City of Cincinnati
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
Data Description: Fire Incident data includes all fire incident responses. This includes emergency medical services (EMS) calls, fires, rescue incidents, and all other services handled by the Fire Department. All runs are coded according to classification: for EMS, this includes ALS (advanced life support); BLS (basic life support); etc.

Data Creation: This data is created when a run is entered into the City of Cincinnati’s computer-aided dispatch (CAD) database.

Data Created By: The source of this data is the City of Cincinnati's computer aided dispatch (CAD) database.

Refresh Frequency: This data is updated daily.

CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/6jrc-cmn5

Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.

Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).

Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad

Disclaimer: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1

Dataset for: Some Remarks on the R2 for Clustering

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.6124508.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Wileyhttps://www.wiley.com/

Authors

Nicola Loperfido; Thaddeus Tarpey

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

Clear search

Close search

Google apps

Main menu

Dataset for: Some Remarks on the R2 for Clustering

Data from: Body temperature distributions of active diurnal lizards in three...

Data Sheet 1_The impact of distribution properties on sampling behavior.docx...

Data from: Adjusting Median and Trimmed-Mean Inflation Rates for Bias Based...

Data from: Improving structured population models with more realistic...

Data from: Selection on skewed characters and the paradox of stasis

Impact of limited data availability on the accuracy of project duration...

Skewness project raw data files and codes

Supplementary data for the paper "Why psychologists should not default to...

Data from: Food web interaction strength distributions are conserved by...

Writing_vs_Tapping(Arabic_English)

Geophone Sensor Dataset

Loan Default Risk Prediction Dataset

📖 Dataset Overview

📊 Features & Data Structure

📌 Data Generation & Logic

Model evaluation for COVID-19 deaths.

A Correction for Structural Equation Modeling Fit Indices Under Missingness:...

Drugs

Northern Ireland Annual Descriptive House Price Statistics (LGD Level) -...

Computation time comparison.

Data from: A broader flight season for Norway’s Odonata across a century and...

Overdose Data (CFD)

Dataset for: Some Remarks on the R2 for Clustering