Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for normally distributed and skewed datasets.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Negatively skewed data arise occasionally in statistical practice; perhaps the most familiar example is the distribution of human longevity. Although other generalizations of the normal distribution exist, we demonstrate a new alternative that apparently fits human longevity data better. We propose an alternative approach of a normal distribution whose scale parameter is conditioned on attained age. This approach is consistent with previous findings that longevity conditioned on survival to the modal age behaves like a normal distribution. We derive such a distribution and demonstrate its accuracy in modeling human longevity data from life tables. The new distribution is characterized by 1. An intuitively straightforward genesis; 2. Closed forms for the pdf, cdf, mode, quantile, and hazard functions; and 3. Accessibility to non-statisticians, based on its close relationship to the normal distribution.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
Facebook
Twitterhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.htmlhttp://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html
DATA EXPLORATION Understand the characteristics of given fields in the underlying data such as variable distributions, whether the dataset is skewed towards a certain demographic and the data validity of the fields. For example, a training dataset may be highly skewed towards the younger age bracket. If so, how will this impact your results when using it to predict over the remaining customer base. Identify limitations surrounding the data and gather external data which may be useful for modelling purposes. This may include bringing in ABS data at different geographic levels and creating additional features for the model. For example, the geographic remoteness of different postcodes may be used as an indicator of proximity to consider to whether a customer is in need of a bike to ride to work.
MODEL DEVELOPMENT Determine a hypothesis related to the business question that can be answered with the data. Perform statistical testing to determine if the hypothesis is valid or not. Create calculated fields based on existing data, for example, convert the D.O.B into an age bracket. Other fields that may be engineered include ‘High Margin Product’ which may be an indicator of whether the product purchased by the customer is in a high margin category in the past three months based on the fields ‘list_price’ and ‘standard cost’. Other examples include, calculating the distance from office to home address to as a factor in determining whether customers may purchase a bicycle for transportation purposes. Additionally, this may include thoughts around determining what the predicted variable actually is. For example, are results predicted in ordinal buckets, nominal, binary or continuous. Test the performance of the model using factors relevant for the given model chosen (i.e. residual deviance, AIC, ROC curves, R Squared). Appropriately document model performance, assumptions and limitations.
INTEPRETATION AND REPORTING Visualisation and presentation of findings. This may involve interpreting the significant variables and co-efficient from a business perspective. These slides should tell a compelling storing around the business issue and support your case with quantitative and qualitative observations. Please refer to module below for further details
The dataset is easy to understand and self-explanatory!
It is important to keep in mind the business context when presenting your findings: 1. What are the trends in the underlying data? 2. Which customer segment has the highest customer value? 3. What do you propose should be the marketing and growth strategy?
Facebook
TwitterInlcuded is the supplementary data for Smith, B. T., Mauck, W. M., Benz, B., & Andersen, M. J. (2018). Uneven missing data skews phylogenomic relationships within the lories and lorikeets. BioRxiv, 398297. The resolution of the Tree of Life has accelerated with advances in DNA sequencing technology. To achieve dense taxon sampling, it is often necessary to obtain DNA from historical museum specimens to supplement modern genetic samples. However, DNA from historical material is generally degraded, which presents various challenges. In this study, we evaluated how the coverage at variant sites and missing data among historical and modern samples impacts phylogenomic inference. We explored these patterns in the brush-tongued parrots (lories and lorikeets) of Australasia by sampling ultraconserved elements in 105 taxa. Trees estimated with low coverage characters had several clades where relationships appeared to be influenced by whether the sample came from historical or modern specimens, which were not observed when more stringent filtering was applied. To assess if the topologies were affected by missing data, we performed an outlier analysis of sites and loci, and a data reduction approach where we excluded sites based on data completeness. Depending on the outlier test, 0.15% of total sites or 38% of loci were driving the topological differences among trees, and at these sites, historical samples had 10.9x more missing data than modern ones. In contrast, 70% data completeness was necessary to avoid spurious relationships. Predictive modeling found that outlier analysis scores were correlated with parsimony informative sites in the clades whose topologies changed the most by filtering. After accounting for biased loci and understanding the stability of relationships, we inferred a more robust phylogenetic hypothesis for lories and lorikeets.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset is an engineered version of the original Ames Housing dataset from the "House Prices: Advanced Regression Techniques" Kaggle competition. The goal of this engineering was to clean the data, handle missing values, encode categorical features, scale numeric features, manage outliers, reduce skewness, select useful features, and create new features to improve model performance for house price prediction.
The original dataset contains information on 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, with the target variable being SalePrice. This engineered version has undergone several preprocessing steps to make it ready for machine learning models.
PoolQC) were filled with "None". Numeric columns were filled with median, and other categorical columns with mode.SalePrice were removed.The final dataset has fewer columns than the original (reduced from 81 to approximately 250 after one-hot encoding, then further reduced by feature selection), with improved quality for modeling.
To add more predictive power, the following new features were created based on domain knowledge:
1. HouseAge: Age of the house at the time of sale. Calculated as YrSold - YearBuilt. This captures how old the house is, which can negatively affect price due to depreciation.
- Example: A house built in 2000 and sold in 2008 has HouseAge = 8.
2. Quality_x_Size: Interaction term between overall quality and living area. Calculated as OverallQual * GrLivArea. This combines quality and size to capture the value of high-quality large homes.
- Example: A house with OverallQual = 7 and GrLivArea = 1500 has Quality_x_Size = 10500.
3. TotalSF: Total square footage of the house. Calculated as GrLivArea + TotalBsmtSF + 1stFlrSF + 2ndFlrSF (if available). This aggregates area features into a single metric for better price prediction.
- Example: If GrLivArea = 1500 and TotalBsmtSF = 1000, TotalSF = 2500.
4. Log_LotArea: Log-transformed lot area to reduce skewness. Calculated as np.log1p(LotArea). This makes the distribution of lot sizes more normal, helping models handle extreme values.
- Example: A lot area of 10000 becomes Log_LotArea ≈ 9.21.
These new features were created using the original (unscaled) values to maintain interpretability, then scaled with RobustScaler to match the rest of the dataset.
SalePrice, such as:
OverallQual: Material and finish quality (scaled, 1-10).GrLivArea: Above grade (ground) living area square feet (scaled).GarageCars: Size of garage in car capacity (scaled).TotalBsmtSF: Total square feet of basement area (scaled).FullBath, YearBuilt, etc. (see the code for the full list).ExterQual: Exterior material quality (encoded as 0=Po to 4=Ex).BsmtQual: Basement quality (encoded as 0=None to 5=Ex).MSZoning_RL: 1 if residential low density, 0 otherwise.Neighborhood_NAmes: 1 if in NAmes neighborhood, 0 otherwise.HouseAge: Age of the house (scaled).Quality_x_Size: Overall quality times living area (scaled).TotalSF: Total square footage (scaled).Log_LotArea: Log-transformed lot area (scaled).SalePrice - The property's sale price in dollars (not scaled, as it's the target).Total columns: Approximately 200-250 (after one-hot encoding and feature selection).
This dataset is derived from the Ames Housing...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.
Facebook
TwitterIn dental epidemiology, the decayed (D), missing (M), and filled (F) teeth or surfaces index (DFM index) is a frequently used measure. The DMF index is characterized by a strongly positive skewed distribution with a large stack of zero counts for those individuals without caries experience. Therefore, standard generalized linear models often lead to a poor fit. The hurdle regression model is a highly suitable class to model a DMF index, but its use is subordinated. We aim to overcome the gap between the suitability of the hurdle model to fit DMF indices and the frequency of its use in caries research. A theoretical introduction to the hurdle model is provided, and an extensive comparison with the zero-inflated model is given. Using an illustrative data example, both types of models are compared, with a special focus on interpretation of their parameters. Accompanying R code and example data are provided as online supplementary material.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material 2: A supplementary file with examples of STATA script for all models that have been fitted in this paper.
Facebook
TwitterThis paper contains the description of the algorithm STCSSP and its interface. STCSSP is a FORTRAN subroutine that computes a structured staircase form for a real (skew-) symmetric / (skew-) symmetric matrix pencil, i.e., a pencil where each of the two matrices is either symmetric or skew-symmetric. An example how to call the subroutine is given.
Facebook
TwitterWe discuss the numerical solution of eigenvalue problems for matrix polynomials, where the coefficient matrices are alternating symmetric and skew symmetric or Hamiltonian and skew Hamiltonian. We discuss several applications that lead to such structures. Matrix polynomials of this type have a symmetry in the spectrum that is the same as that of Hamiltonian matrices or skew-Hamiltonian/Hamiltonian pencils. The numerical methods that we derive are designed to preserve this eigenvalue symmetry. We also discuss linearization techniques that transform the polynomial into a skew-Hamiltonian/Hamiltonian linear eigenvalue problem with a specific substructure. For this linear eigenvalue problem we discuss special factorizations that are useful in shift-and-invert Krylov subspace methods for the solution of the eigenvalue problem. We present a numerical example that demonstrates the effectiveness of our approach.
Facebook
TwitterSediment particle size frequency distributions from the USNL (Unites States Naval Laboratory) box cores were determined optically using a Malvern Mastersizer 2000 He-Ne LASER diffraction sizer and were used to resolve mean particle size, sorting, skewness and kurtosis.
Samples were collected on cruises JR16006 and JR17007.
Funding was provided by ''The Changing Arctic Ocean Seafloor (ChAOS) - how changing sea ice conditions impact biological communities, biogeochemical processes and ecosystems'' project (NE/N015894/1 and NE/P006426/1, 2017-2021), part of the NERC funded Changing Arctic Ocean programme.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Supplementary Material 3: A supplementary file with examples of SAS script for all models that have been fitted in this paper.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Card: Loan Prediction Dataset
Overview The Loan Prediction Dataset is designed for binary classification tasks where the goal is to predict whether a loan should be approved or denied based on applicant information. This dataset includes demographic and financial features commonly used by financial institutions to assess loan eligibility.
Dataset Contents The dataset consists of 1,000 samples, each representing an individual loan application with the following features:
Facebook
TwitterGrain size of 139 unconsolidated sediment samples from seven DSDP sites in the Guaymas Basin and the southeastern tip of the Baja California Peninsula was determined by sieve and pipette techniques. Shepard (1954) classification and Inman (1952) parameters correlation were used for all samples. Sediment texture ranged from sand to silty clay. On the basis of grain-size parameter, the sediments can be divided into three broad groups: (1) very fine sands and coarse silts; (2) medium- to very fine silts; and (3) clays and coarse silts.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Principal component analysis (PCA) is a popular dimension-reduction method to reduce the complexity and obtain the informative aspects of high-dimensional datasets. When the data distribution is skewed, data transformation is commonly used prior to applying PCA. Such transformation is usually obtained from previous studies, prior knowledge, or trial-and-error. In this work, we develop a model-based method that integrates data transformation in PCA and finds an appropriate data transformation using the maximum profile likelihood. Extensions of the method to handle functional data and missing values are also developed. Several numerical algorithms are provided for efficient computation. The proposed method is illustrated using simulated and real-world data examples. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We introduce a novel framework that leverages the Box-Cox transformation to address the skewed distribution of samples and incorporates additional critical predictors to enhance the accuracy of soil salinity (measured as electrical conductivity of the saturated soil extract, ECe) and sodicity (measured as exchangeable sodium percentage, ESP) estimates.
We provide high-resolution (1 km × 1 km) global maps of soil salinity and sodicity from 1980 to 2022 in GeoTIFF file format for each year.
Note: the scale factor for ECe and ESP maps is 0.001
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Example data for normally distributed and skewed datasets.