Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
ObjectivePeople often have their decisions influenced by rare outcomes, such as buying a lottery and believing they will win, or not buying a product because of a few negative reviews. Previous research has pointed out that this tendency is due to cognitive issues such as flaws in probability weighting. In this study we examine an alternative hypothesis: that people’s search behavior is biased by rare outcomes, and they can adjust the estimation of option value to be closer to the true mean, reflecting cognitive processes to adjust for sampling bias.MethodsWe recruited 180 participants through Prolific to take part in an online shopping task. On each trial, participants saw a histogram with five bins, representing the percentage of one- to five-star ratings of previous customers on a product. They could click on each bin of the histogram to examine an individual review that gave that product the corresponding star; the review was represented using a number from 0–100 called the positivity score. The goal of the participants was to sample the bins so that they could get the closest estimate of the average positivity score as possible, and they were incentivized based on accuracy of estimation. We varied the shape of the histograms within subject and the number of samples they had between subjects to examine how rare outcomes in skewed distributions influenced sampling behavior and whether having more samples would help people adjust their estimation to be closer to the true mean.ResultsBinomial tests confirmed sampling biases toward rare outcomes. Compared with 1% expected under unbiased sampling, participants allocated 11% and 12% of samples to the rarest outcome bin in the negatively and positively skewed conditions, respectively (ps < 0.001). A Bayesian linear mixed-effects analysis examined the effect of skewness and samples on estimation adjustment, defined as the difference between experienced /observed means and participants’ estimates. In the negatively skewed distribution, estimates were on average 7% closer to the true mean compared with the observed means (10-sample ∆ = −0.07, 95% CI [−0.08, −0.06]; 20-sample ∆ = −0.07, 95% CI [−0.08, −0.06]). In the positively skewed condition, estimates also moved closer to the true mean (10-sample ∆ = 0.02, 95% CI [0.01, 0.04]; 20-sample ∆ = 0.03, 95% CI [0.02, 0.04]). Still, participants’ estimates deviated from the true mean by about 9.3% on average, underscoring the persistent influence of sampling bias.ConclusionThese findings demonstrate how search biases systematically affect distributional judgments and how cognitive processes interact with biased sampling. The results have implications for human–algorithm interactions in areas such as e-commerce, social media, and politically sensitive decision-making contexts.
Facebook
TwitterMedian and trimmed-mean inflation rates tend to be useful estimates of trend inflation over long periods, but they can exhibit persistent departures from the underlying trend over shorter horizons. In this Commentary, we document that the extent of this bias is related to the degree of skewness in the distribution of price changes. The shift in the skewness of the cross-sectional price-change distribution during the pandemic means that median PCE and trimmed-mean PCE inflation rates have recently been understating the trend in PCE inflation by about 15 and 35 basis points, respectively.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
Observed phenotypic responses to selection in the wild often differ from predictions based on measurements of selection and genetic variance. An overlooked hypothesis to explain this paradox of stasis is that a skewed phenotypic distribution affects natural selection and evolution. We show through mathematical modelling that, when a trait selected for an optimum phenotype has a skewed distribution, directional selection is detected even at evolutionary equilibrium, where it causes no change in the mean phenotype. When environmental effects are skewed, Lande and Arnold’s (1983) directional gradient is in the direction opposite to the skew. In contrast, skewed breeding values can displace the mean phenotype from the optimum, causing directional selection in the direction of the skew. These effects can be partitioned out using alternative selection estimates based on average derivatives of individual relative fitness, or additive genetic covariances between relative fitness and trait (Robertson-Price identity). We assess the validity of these predictions using simulations of selection estimation under moderate samples size. Ecologically relevant traits may commonly have skewed distributions, as we here exemplify with avian laying date – repeatedly described as more evolutionarily stable than expected –, so this skewness should be accounted for when investigating evolutionary dynamics in the wild.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This database includes simulated data showing the accuracy of estimated probability distributions of project durations when limited data are available for the project activities. The base project networks are taken from PSPLIB. Then, various stochastic project networks are synthesized by changing the variability and skewness of project activity durations.
Number of variables: 20
Number of cases/rows: 114240
Variable List:
• Experiment ID: The ID of the experiment
• Experiment for network: The ID of the experiment for each of the synthesized networks
• Network ID: ID of the synthesized network
• #Activities: Number of activities in the network, including start and finish activities
• Variability: Variance of the activities in the network (this value can be either high, low, medium or rand, where rand shows a random combination of low, high and medium variance in the network activities.)
• Skewness: Skewness of the activities in the network (Skewness can be either right, left, None or rand, where rand shows a random combination of right, left, and none skewed in the network activities)
• Fitted distribution type: Distribution type used to fit on sampled data
• Sample size: Number of sampled data used for the experiment resembling limited data condition
• Benchmark 10th percentile: 10th percentile of project duration in the benchmark stochastic project network
• Benchmark 50th percentile: 50th project duration in the benchmark stochastic project network
• Benchmark 90th percentile: 90th project duration in the benchmark stochastic project network
• Benchmark mean: Mean project duration in the benchmark stochastic project network
• Benchmark variance: Variance project duration in the benchmark stochastic project network
• Experiment 10th percentile: 10th percentile of project duration distribution for the experiment
• Experiment 50th percentile: 50th percentile of project duration distribution for the experiment
• Experiment 90th percentile: 90th percentile of project duration distribution for the experiment
• Experiment mean: Mean of project duration distribution for the experiment
• Experiment variance: Variance of project duration distribution for the experiment
• K-S: Kolmogorov–Smirnov test comparing benchmark distribution and project duration
• distribution of the experiment
• P_value: the P-value based on the distance calculated in the K-S test
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository contains raw data files and base codes to analyze them.A. The 'powerx_y.xlsx' files are the data files with the one dimensional trajectory of optically trapped probes modulated by an Ornstein-Uhlenbeck noise of given 'x' amplitude. For the corresponding diffusion amplitude A=0.1X(0.6X10-6)2 m2/s, x is labelled as '1'B. The codes are of three types. The skewness codes are used to calculate the skewness of the trajectory. The error_in_fit codes are used to calculate deviations from arcsine behavior. The sigma_exp codes point to the deviation of the mean from 0.5. All the codes are written three times to look ar T+, Tlast and Tmax.C. More information can be found in the manuscript.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Species interactions in food webs are usually recognized as dynamic, varying across species, space and time due to biotic and abiotic drivers. Yet food webs also show emergent properties that appear consistent, such as a skewed frequency distribution of interaction strengths (many weak, few strong). Reconciling these two properties requires an understanding of the variation in pairwise interaction strengths and its underlying mechanisms. We estimated stream sculpin feeding rates in three seasons at nine sites in Oregon to examine variation in trophic interaction strengths both across and within predator-prey pairs. Predator and prey densities, prey body mass, and abiotic factors were considered as putative drivers of within-pair variation over space and time. We hypothesized that consistently skewed interaction strength distributions could result if individual interaction strengths show relatively little variation, or alternatively, if interaction strengths vary but shift in ways that conserve their overall frequency distribution. Feeding rate distributions remained consistently and positively skewed across all sites and seasons. The mean coefficient of variation in feeding rates within each of 25 focal species pairs across surveys was less than half the mean coefficient of variation seen across species pairs within a survey. The rank order of feeding rates also remained conserved across streams, seasons and individual surveys. On average, feeding rates on each prey taxon nonetheless varied by a hundredfold, with some feeding rates showing more variation in space and others in time. In general, feeding rates increased with prey density and decreased with high stream flows and low water temperatures, although for nearly half of all species pairs, factors other than prey density explained the most variation. Our findings show that although individual interaction strengths exhibit considerable variation in space and time, they can nonetheless remain relatively consistent, and thus predictable, compared to the even larger variation that occurs across species pairs. These results highlight how the ecological scale of inference can strongly shape conclusions about interaction strength consistency and collectively help reconcile how the skewed nature of interaction strength distributions can persist in highly dynamic food webs.
Facebook
TwitterThis is the dataset reflects the recorded times that it took for 72 participants to transcribe an Arabic text, and 78 participants to transcribe an English text, both by paper and by smartphone. (*Note that Participant 48 in the English subgroup was identified as an outlier as times for smartphone entry were over 5 SD away from the mean.) All data points are times (in seconds).
It was hypothesized, based on precursor research, that handwriting would be faster than smartphone entry for participants writing in their second language. This hypothesis was supported by this data. Also, the non-normal distributions of the English subgroups (the second language of the participants) is typical of research based on self-paced actions (in this case, self-paced writing). Both subgroups of the English data were positively skewed.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset contains vibration data collected from a geoscope sensor to analyze human activities (walking, running, and waiting). The data is segmented into 3-second time windows, with each window containing 120 rows of data per person. The dataset consists of 1800 rows of data from five individuals: Furkan, Enes, Yusuf, Alihan and Emir.
Each person’s activity is classified into one of the three categories: walking, running, or standing still. The data includes both statistical and frequency-domain features extracted from the raw vibration signals, detailed below:
Statistical Features: - Mean: The average value of the signal over the time window.- - Median: The middle value of the signal, dividing the data into two equal halves. - Standard Deviation: A measure of how much the signal deviates from its mean, indicating the signal's variability. - Minimum: The smallest value in the signal during the time window. - Maximum: The largest value in the signal during the time window. - First Quartile (Q1): The median of the lower half of the data, representing the 25th percentile. - Third Quartile (Q3): The median of the upper half of the data, representing the 75th percentile. - Skewness: A measure of the asymmetry of the signal distribution, showing whether the data is skewed to the left or right.
Frequency-Domain Features: - Dominant Frequency: The frequency with the highest power, providing insights into the primary periodicity of the signal. - Signal Energy: The total energy of the signal, representing the sum of the squared signal values over the time window.
Dataset Overview: - Total Rows: 1800 - Number of Individuals: 5 (Furkan, Enes, Yusuf, Alihan, Emir) - Activity Types: Walking, Running, Waiting (Standing Still) - Time Frame: 3-second time windows (120 rows per individual for each activity) - Features: Statistical and frequency-domain features (as described above)
This dataset is suitable for training models on activity recognition, user identification, and other related tasks. It provides rich, detailed features that can be used for various classification and analysis applications.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This dataset is designed for financial risk assessment and loan default prediction using machine learning techniques. It includes 300 records, each representing an individual with financial attributes that influence the likelihood of loan default.
The dataset contains the following columns:
| Column Name | Type | Description |
|---|---|---|
| Retirement_Age | float | Age at which the individual retires (left-skewed distribution). |
| Debt_Amount | float | Total debt held by the individual in dollars (right-skewed distribution). |
| Monthly_Savings | float | Average monthly savings in dollars (normally distributed). |
| Loan_Default_Risk | int (0/1) | Target variable: 1 = Default, 0 = No Default. |
The dataset was synthetically created using statistical distributions that mimic real-world financial behavior:
🔹 Retirement Age (Left-Skewed): Generated using a transformed normal distribution to ensure most values are high (60-85).
🔹 Debt Amount (Right-Skewed): Generated using a log-normal distribution, where most people have low debt, but a few have very high debt.
🔹 Monthly Savings (Symmetric): Normally distributed with mean $2000$ and standard deviation $500$, clipped between $500-$5000.
🔹 Loan Default Risk (Target Variable): Computed using a logistic function, where:
- Lower retirement age ⬆ default risk
- Higher debt ⬆ default risk
- Higher savings ⬇ default risk
- The probability threshold was adjusted to balance 0s and 1s.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Missing data is a frequent occurrence in both small and large datasets. Among other things, missingness may be a result of coding or computer error, participant absences, or it may be intentional, as in a planned missing design. Whatever the cause, the problem of how to approach a dataset with holes is of much relevance in scientific research. First, missingness is approached as a theoretical construct, and its impacts on data analysis are encountered. I discuss missingness as it relates to structural equation modeling and model fit indices, specifically its interaction with the Root Mean Square Error of Approximation (RMSEA). Data simulation is used to show that RMSEA has a downward bias with missing data, yielding skewed fit indices. Two alternative formulas for RMSEA calculation are proposed: one correcting degrees of freedom and one using Kullback-Leibler divergence to result in an RMSEA calculation which is relatively independent of missingness. Simulations are conducted in Java, with results indicating that the Kullback-Leibler divergence provides a better correction for RMSEA calculation. Next, I approach missingness in an applied manner with an existing large dataset examining ideology measures. The researchers assessed ideology using a planned missingness design, resulting in high proportions of missing data. Factor analysis was performed to gauge uniqueness of ideology measures.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Calls For Service are the events captured in an agency’s Computer-Aided Dispatch (CAD) system used to facilitate incident response.
This dataset includes both proactive and reactive police incident data.
The source of this data is the City of Cincinnati's computer-aided dispatch (CAD) database.
This data is updated daily.
DISCLAIMER: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.
Facebook
TwitterOpen Government Licence 3.0http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
License information was derived automatically
Annual descriptive price statistics for each calendar year 2005 – 2024 for 11 Local Government Districts in Northern Ireland. The statistics include: • Minimum sale price • Lower quartile sale price • Median sale price • Simple Mean sale price • Upper Quartile sale price • Maximum sale price • Number of verified sales Prices are available where at least 30 sales were recorded in the area within the calendar year which could be included in the regression model i.e. the following sales are excluded: • Non Arms-Length sales • sales of properties where the habitable space are less than 30m2 or greater than 1000m2 • sales less than £20,000. Annual median or simple mean prices should not be used to calculate the property price change over time. The quality (where quality refers to the combination of all characteristics of a residential property, both physical and locational) of the properties that are sold may differ from one time period to another. For example, sales in one quarter could be disproportionately skewed towards low-quality properties, therefore producing a biased estimate of average price. The median and simple mean prices are not ‘standardised’ and so the varying mix of properties sold in each quarter could give a false impression of the actual change in prices. In order to calculate the pure property price change over time it is necessary to compare like with like, and this can only be achieved if the ‘characteristics-mix’ of properties traded is standardised. To calculate pure property change over time please use the standardised prices in the NI House Price Index Detailed Statistics file.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
COVID-19 prediction has been essential in the aid of prevention and control of the disease. The motivation of this case study is to develop predictive models for COVID-19 cases and deaths based on a cross-sectional data set with a total of 28,955 observations and 18 variables, which is compiled from 5 data sources from Kaggle. A two-part modeling framework, in which the first part is a logistic classifier and the second part includes machine learning or statistical smoothing methods, is introduced to model the highly skewed distribution of COVID-19 cases and deaths. We also aim to understand what factors are most relevant to COVID-19’s occurrence and fatality. Evaluation criteria such as root mean squared error (RMSE) and mean absolute error (MAE) are used. We find that the two-part XGBoost model perform best with predicting the entire distribution of COVID-19 cases and deaths. The most important factors relevant to either COVID-19 cases or deaths include population and the rate of primary care physicians.
Facebook
Twitterhttps://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html
As global climate continues to change, so too will phenology of a wide range of insects. Changes in flight season usually are characterised as shifts to earlier dates or means, with attention less often paid to flight season breadth or whether seasons are now skewed. We amassed flight season data for the insect order Odonata, the dragonflies and damselflies, for Norway over the past century-and-a-half to examine the form of flight season change. By means of Bayesian analyses that incorporated uncertainty relative to annual variability in survey effort, we estimated shifts in flight season mean, breadth, and skew. We focussed on flight season breadth, positing that it will track documented growing season expansion. A specific mechanism explored was shifts in voltinism, the number of generations per year, which tends to increase with warming. We found strong evidence for an increase in flight season breadth but much less for a shift in mean, with any shift of the latter tending toward a later mean. Skew has become rightward for suborder Zygoptera, the damselflies, but not for Anisoptera, the dragonflies, or for the Odonata as a whole. We found weak support for voltinism as a predictor of broader flight season; instead, voltinism acted interactively with use of human-modified habitats, including decrease in shading (e.g., from timber extraction). Other potential mechanisms that link warming with broadening of flight season include protracted emergence and cohort splitting, both of which have been documented in the Odonata. It is likely that warming-induced broadening of flight seasons of these widespread insect predators will have wide-ranging consequences for freshwater ecosystems. Methods Data was extracted from Artsdatabanken, a public database for Norway. Data were cleaned, and useable records served as the basis for analyses.
Facebook
TwitterU.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Data Description: Fire Incident data includes all fire incident responses. This includes emergency medical services (EMS) calls, fires, rescue incidents, and all other services handled by the Fire Department. All runs are coded according to classification: for EMS, this includes ALS (advanced life support); BLS (basic life support); etc.
Data Creation: This data is created when a run is entered into the City of Cincinnati’s computer-aided dispatch (CAD) database.
Data Created By: The source of this data is the City of Cincinnati's computer aided dispatch (CAD) database.
Refresh Frequency: This data is updated daily.
CincyInsights: The City of Cincinnati maintains an interactive dashboard portal, CincyInsights in addition to our Open Data in an effort to increase access and usage of city data. This data set has an associated dashboard available here: https://insights.cincinnati-oh.gov/stories/s/6jrc-cmn5
Data Dictionary: A data dictionary providing definitions of columns and attributes is available as an attachment to this dataset.
Processing: The City of Cincinnati is committed to providing the most granular and accurate data possible. In that pursuit the Office of Performance and Data Analytics facilitates standard processing to most raw data prior to publication. Processing includes but is not limited: address verification, geocoding, decoding attributes, and addition of administrative areas (i.e. Census, neighborhoods, police districts, etc.).
Data Usage: For directions on downloading and using open data please visit our How-to Guide: https://data.cincinnati-oh.gov/dataset/Open-Data-How-To-Guide/gdr9-g3ad
Disclaimer: In compliance with privacy laws, all Public Safety datasets are anonymized and appropriately redacted prior to publication on the City of Cincinnati’s Open Data Portal. This means that for all public safety datasets: (1) the last two digits of all addresses have been replaced with “XX,” and in cases where there is a single digit street address, the entire address number is replaced with "X"; and (2) Latitude and Longitude have been randomly skewed to represent values within the same block area (but not the exact location) of the incident.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.