3 datasets found

f
Dataset for: Some Remarks on the R2 for Clustering
wiley.figshare.com
txt
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6124508.v1
Dataset updated
Jun 1, 2023
Dataset provided by
Wiley
Authors
Nicola Loperfido; Thaddeus Tarpey
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.
u
Results and analysis using the Lean Six-Sigma define, measure, analyze,...
researchdata.up.ac.za
docx
Updated Mar 12, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Modiehi Mophethe (2024). Results and analysis using the Lean Six-Sigma define, measure, analyze, improve, and control (DMAIC) Framework [Dataset]. http://doi.org/10.25403/UPresearchdata.25370374.v1
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.25403/UPresearchdata.25370374.v1
Dataset updated
Mar 12, 2024
Dataset provided by
University of Pretoria
Authors
Modiehi Mophethe
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This section presents a discussion of the research data. The data was received as secondary data however, it was originally collected using the time study techniques. Data validation is a crucial step in the data analysis process to ensure that the data is accurate, complete, and reliable. Descriptive statistics was used to validate the data. The mean, mode, standard deviation, variance and range determined provides a summary of the data distribution and assists in identifying outliers or unusual patterns. The data presented in the dataset show the measures of central tendency which includes the mean, median and the mode. The mean signifies the average value of each of the factors presented in the tables. This is the balance point of the dataset, the typical value and behaviour of the dataset. The median is the middle value of the dataset for each of the factors presented. This is the point where the dataset is divided into two parts, half of the values lie below this value and the other half lie above this value. This is important for skewed distributions. The mode shows the most common value in the dataset. It was used to describe the most typical observation. These values are important as they describe the central value around which the data is distributed. The mean, mode and median give an indication of a skewed distribution as they are not similar nor are they close to one another. In the dataset, the results and discussion of the results is also presented. This section focuses on the customisation of the DMAIC (Define, Measure, Analyse, Improve, Control) framework to address the specific concerns outlined in the problem statement. To gain a comprehensive understanding of the current process, value stream mapping was employed, which is further enhanced by measuring the factors that contribute to inefficiencies. These factors are then analysed and ranked based on their impact, utilising factor analysis. To mitigate the impact of the most influential factor on project inefficiencies, a solution is proposed using the EOQ (Economic Order Quantity) model. The implementation of the 'CiteOps' software facilitates improved scheduling, monitoring, and task delegation in the construction project through digitalisation. Furthermore, project progress and efficiency are monitored remotely and in real time. In summary, the DMAIC framework was tailored to suit the requirements of the specific project, incorporating techniques from inventory management, project management, and statistics to effectively minimise inefficiencies within the construction project.
4
Supplementary data for the paper "Why psychologists should not default to...
data.4tu.nl
zip
Updated Apr 28, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joost de Winter (2025). Supplementary data for the paper "Why psychologists should not default to Welch’s t-test instead of Student’s t-test (and why the Anderson–Darling test is an underused alternative)" [Dataset]. http://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/e8e6861a-7ab0-4b6d-bd67-5f95029322c5.v3
Dataset updated
Apr 28, 2025
Dataset provided by
4TU.ResearchData
Authors
Joost de Winter
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This paper evaluates the claim that Welch’s t-test (WT) should replace the independent-samples t-test (IT) as the default approach for comparing sample means. Simulations involving unequal and equal variances, skewed distributions, and different sample sizes were performed. For normal distributions, we confirm that the WT maintains the false positive rate close to the nominal level of 0.05 when sample sizes and standard deviations are unequal. However, the WT was found to yield inflated false positive rates under skewed distributions, even with relatively large sample sizes, whereas the IT avoids such inflation. A complementary empirical study based on gender differences in two psychological scales corroborates these findings. Finally, we contend that the null hypothesis of unequal variances together with equal means lacks plausibility, and that empirically, a difference in means typically coincides with differences in variance and skewness. An additional analysis using the Kolmogorov-Smirnov and Anderson-Darling tests demonstrates that examining entire distributions, rather than just their means, can provide a more suitable alternative when facing unequal variances or skewed distributions. Given these results, researchers should remain cautious with software defaults, such as R favoring Welch’s test.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Nicola Loperfido; Thaddeus Tarpey (2023). Dataset for: Some Remarks on the R2 for Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.6124508.v1

Dataset for: Some Remarks on the R2 for Clustering

Explore at:

txtAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.6124508.v1

Dataset updated

Jun 1, 2023

Dataset provided by

Wiley

Authors

Nicola Loperfido; Thaddeus Tarpey

License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Description

A common descriptive statistic in cluster analysis is the $R^2$ that measures the overall proportion of variance explained by the cluster means. This note highlights properties of the $R^2$ for clustering. In particular, we show that generally the $R^2$ can be artificially inflated by linearly transforming the data by ``stretching'' and by projecting. Also, the $R^2$ for clustering will often be a poor measure of clustering quality in high-dimensional settings. We also investigate the $R^2$ for clustering for misspecified models. Several simulation illustrations are provided highlighting weaknesses in the clustering $R^2$, especially in high-dimensional settings. A functional data example is given showing how that $R^2$ for clustering can vary dramatically depending on how the curves are estimated.

Clear search

Close search

Google apps

Main menu

Dataset for: Some Remarks on the R2 for Clustering

Results and analysis using the Lean Six-Sigma define, measure, analyze,...

Supplementary data for the paper "Why psychologists should not default to...

Dataset for: Some Remarks on the R2 for Clustering