8 datasets found
  1. f

    Additional file 2 of Thresher: determining the number of clusters while...

    • springernature.figshare.com
    zip
    Updated Jun 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes (2023). Additional file 2 of Thresher: determining the number of clusters while removing outliers [Dataset]. http://doi.org/10.6084/m9.figshare.5768622.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 3, 2023
    Dataset provided by
    figshare
    Authors
    Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    R Code for Analyses. This is a zip file containing all of the R code used to perform simulations and to analyze the breast cancer data. (ZIP 407 kb)

  2. f

    Blind method for discovering number of clusters in multidimensional datasets...

    • plos.figshare.com
    docx
    Updated Jun 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788
    Explore at:
    docxAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Osbert C. Zalay
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.

  3. f

    Data from: Dynamic Tensor Clustering

    • tandf.figshare.com
    pdf
    Updated Oct 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Will Wei Sun; Lexin Li (2024). Dynamic Tensor Clustering [Dataset]. http://doi.org/10.6084/m9.figshare.7433114.v3
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 10, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Will Wei Sun; Lexin Li
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a general-order tensor. There is also a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we propose a new dynamic tensor clustering method that works for a general-order dynamic tensor, and enjoys both strong statistical guarantee and high computational efficiency. Our proposal is based on a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. Theoretically, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with high probability. Moreover, our proposed method can be naturally extended to co-clustering of multiple modes of the tensor data. The efficacy of our method is illustrated through simulations and a brain dynamic functional connectivity analysis from an autism spectrum disorder study. Supplementary materials for this article are available online.

  4. Nepal Multiple Indicator Cluster Survey 2010 - Nepal

    • microdata.nsonepal.gov.np
    Updated Mar 2, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Central Bureau of Statistics (2016). Nepal Multiple Indicator Cluster Survey 2010 - Nepal [Dataset]. https://microdata.nsonepal.gov.np/index.php/catalog/36
    Explore at:
    Dataset updated
    Mar 2, 2016
    Dataset authored and provided by
    Central Bureau of Statisticshttp://cbs.gov.np/
    Time period covered
    2010
    Area covered
    Nepal
    Description

    Abstract

    Nepal Multiple indicator Cluster Survey (NMICS) was conducted in 2010 by Central Bureau of Statistics (CBS) with the primary objective of filling the data gap on children and women that existed particularly in the Mid-western and Far-western regions of Nepal. The NMICS 2010 was implemented as part of the fourth round of the global MICS household survey programme with technical and financial support of UNICEF, Nepal.

    NMICS 2010 has generated a wealth of information on children and women which is of immense importance to monitor and evaluate plan and programmes related to children and women of the regions. These data will help to monitor towards goals and targets of international agreements such as Millennium Development Goals (MDGs) and World Fit for Children (WFFC). The NMICS 2010 covers topics related to child health, water and sanitation, reproductive health, child development, education and literacy, child protection, HIV and AIDS, mass media and use of information, communication technology, attitude towards domestic violence, tobacco and alcohol use and life satisfaction.

    Geographic coverage

    National Mid-Western and Far-Western regions Urban and rural areas

    Analysis unit

    household, woman aged 15-49 years, child aged 0-4 years

    Universe

    The survey covered all de jure household members (usual residents), all women aged 15-49 years resident in the household, and all children aged 0-4 years (under age 5) resident in the household.

    Kind of data

    Sample survey data [ssd]

    Sampling procedure

    The sample for the Nepal Multiple Indicator Cluster Survey (MICS) was designed to provide estimates for a large number of indicators on the situation of children and women at the national level, for urban and rural areas, and for six domains in the Mid- and Far-Western regions:

    a. Mid-Western Mountains b. Mid-Western Hills c. Mid-Western Terai d. Far-Western Mountains e. Far-Western Hills and f. Far-Western Terai

    The urban and rural areas within each region were identified as the main sampling strata and the sample was selected in two stages. Within each domain, 40 clusters (wards) were selected systematically with probability proportional to size, to yield a total of 240 wards. After a household listing was carried out within the selected wards, a systematic sample of 25 households was drawn from each ward. Smaller wards, where the total number of households was less than 25, were grouped with adjoining wards to bring the number of households to at least 25. Two adjoining wards were grouped together in nine clusters: one rural cluster each in Achham, Dolpa and Kailali and two rural clusters each in Baitadi, Bajhang and Humla.

    Similarly, in case of large wards, especially in the urban areas or municipalities, census enumeration blocks were used. Enumeration blocks were created by segmenting large wards for the purpose of the population census 2011 by GIS section with CBS. Out of 50 urban clusters, enumeration blocks were used in 22 clusters of large urban municipalities in the five districts of Banke, Dang, Kailali, Kanchanpur and Surkhet. Thus a total of 5,998 households were selected for the interviewing process, out of which 1,250 represented the urban areas (Municipalities) and remaining 4,750 represented the rural areas (Village Development Committees or VDCs). The sample was stratified by regions and is not self-weighting. However, sample weights were applied in the reporting of sub-regional results.

    In the actual survey, the number of households successfully interviewed was 5899. Among those households the women aged 15-49, 7372 and children 0-4 years, 3574.

    Mode of data collection

    Face-to-face [f2f]

    Research instrument

    The standard MICS4 questionnaires have been adapted to include several country-specific modules as well as all the standard questions from the questionnaire modules. Three sets of questionnaires were used in the survey:

    1) A household questionnaire which was used to collect information on all de jure household members (usual residents), the household, and the dwelling. 2) A women's questionnaire administered in each household to all women aged 15-49 years; and 3) An under-5 questionnaire, administered to mothers or caretakers for all children under-5 living in the household.

    The household questionnaire included the following modules: - Household Listing Form - Education - Water and Sanitation - Household Characteristics - Child Labour - De-worming (Nepal Specific Module) - Child Discipline - Handwashing - Salt Iodization

    The Questionnaire for Individual Women was administered to all women aged 15-49 years living in the households, and included the following modules: - Women's Background - Access to Mass Media and Use of Information Communication Technology - Desire for Last Birth - Maternal and Newborn Health - Illness Symptoms - Contraception - Unmet Need - Attitudes towards Domestic Violence - Marriage/Union - HIV/AIDS - Tobacco and Alcohol Use - Life Satisfaction

    The Questionnaire for Children under five was administered to mothers or caretakers of children under-5 years of age living in the households. Normally, the questionnaire was administered to mothers of under-5 children; in cases when the mother was not listed in the household roster, a primary caretaker for the child was identified and interviewed. The questionnaire included the following modules: - Age - Birth Registration - Early Childhood Development - Breastfeeding - Care of Illness - Malaria - Immunization - Child Grant (Nepal Specific Module)

    The questionnaires are based on the MICS4 model questionnaire. From the MICS4 model English version, the questionnaires were translated into Nepali and two other local dialects, Tharu and Awadhi, which are spoken in the Terai region and were pre-tested in Jumla (Mountain/Rural), Salyan (Hill/Rural) and Banke (Terai/Urban) during July 2010. Based on the results of the pre-test, modifications were made to the wording and translation of the questionnaires. However, due to sensitivity of language issues in present transitional political context of the country and implication on other surveys to be conducted by CBS, Nepali questionnaire was used to record the data.

    In addition to the administration of questionnaires, fieldwork teams tested the salt used for cooking in the households for iodine content and observed the place for handwashing.

    Cleaning operations

    In order to ensure quality control, all questionnaires were double entered and internal consistency checks were performed. Procedures and standard programs developed under the global MICS4 programme and adapted to the Nepal questionnaire were used throughout.

    Response rate

    Of the 6000 households selected for the sample, 5,917 were found to be occupied. Of these, 5,899 were successfully interviewed for a household response rate of 99.7 percent. In the interviewed households, 7,674 women (age 15-49 years) were identified. Of these, 7,372 were successfully interviewed, yielding a response rate of 96.1 percent within interviewed households. In addition, 3,688 children under age five were listed in the household questionnaire. Questionnaires were completed for 3,574 of these children, which corresponds to a response rate of 96.9 percent within interviewed households. Overall response rates of 95.8 and 96.6 are calculated for the women's and under-5's interviews respectively.

    The overall response rate for women and children under five is slightly higher in rural area compared to urban area, but the response rate for households was same for both areas. The response rates for households, women and under five children were similar (above 95 percent) across all regions.

    Sampling error estimates

    The sample of respondents selected in the Nepal Multiple Indicator Cluster Survey is only one of the samples that could have been selected from the same population, using the same design and size. Each of these samples would yield results that differ somewhat from the results of the actual sample selected. Sampling errors are a measure of the variability between the estimates from all possible samples. The extent of variability is not known exactly, but can be estimated statistically from the survey data.

    The following sampling error measures were computed for selected indicators:

    1. Standard error (se): Sampling errors are usually measured in terms of standard errors for particular indicators (means, proportions etc.). Standard error is the square root of the variance of the estimate. The Taylor linearization method is used for the estimation of standard errors.

    2. Coefficient of variation (se/r) is the ratio of the standard error to the value of the indicator, and is a measure of the relative sampling error.

    3. Design effect (deff) is the ratio of the actual variance of an indicator, under the sampling method used in the survey, to the variance calculated under the assumption of simple random sampling. The square root of the design effect (deft) is used to show the efficiency of the sample design in relation to the precision. A deft value of 1.0 indicates that the sample design is as efficient as a simple random sample, while a deft value above 1.0 indicates the increase in the standard error due to the use of a more complex sample design.

    4. Confidence limits are calculated to show the interval within which the true value for the population can be reasonably assumed to fall, with a specified level of confidence. For any given statistic calculated from the survey, the value of that statistic will fall within a range of plus or minus two times the standard

  5. Subject exclusion criteria.

    • plos.figshare.com
    • figshare.com
    xls
    Updated May 30, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samuel G. Thorpe; Corey M. Thibeault; Nicolas Canac; Kian Jalaleddini; Amber Dorn; Seth J. Wilk; Thomas Devlin; Fabien Scalzo; Robert B. Hamilton (2023). Subject exclusion criteria. [Dataset]. http://doi.org/10.1371/journal.pone.0228642.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Samuel G. Thorpe; Corey M. Thibeault; Nicolas Canac; Kian Jalaleddini; Amber Dorn; Seth J. Wilk; Thomas Devlin; Fabien Scalzo; Robert B. Hamilton
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Subject exclusion criteria.

  6. f

    Appendix 1. Statistical Descriptive: Table 1.8 Sampling Data between...

    • figshare.com
    png
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Andi Abdillah Triono (2025). Appendix 1. Statistical Descriptive: Table 1.8 Sampling Data between Districts and Gender [Dataset]. http://doi.org/10.6084/m9.figshare.28905209.v2
    Explore at:
    pngAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset provided by
    figshare
    Authors
    Muhammad Andi Abdillah Triono
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Disclaimer:The raw data of this crosstabulation originates from the 2023 Medan City MSME survey undertaken by the Medan City administration. All MSMEs in this study possess a business registration or operating permit in Medan City. The researcher has received authorization from the Medan City Government to analyze and publish this data by permit number 000.9/1826 in 2023.Interpretation:This table presents a demographic breakdown of sampled entrepreneurial individuals across 21 districts in Medan City. The table categorises participants or business owners by gender (men and women) and provides a total count for each district.Key Insights:Overall Gender Distribution:Women (962 individuals) make up 67.8% of the sample.Men (458 individuals) account for 32.2%.The total sample size is 1,420 individuals.Districts with the Highest Women's Representation:Medan Helvetia: 143 women (77.3%).Medan Sunggal: 92 women (63.0%).Medan Barat: 56 women (73.7%).Districts with the Highest Men's Representation:Medan Sunggal: 54 men (37.0%).Medan Denai: 42 men (40.4%).Medan Helvetia: 42 men (22.7%).District with Lowest Sample Size:Medan Maimun: Only 20 individuals sampled (7 men and 13 women).Potential Research Implications:Gender-based entrepreneurial activity: Certain districts may have higher concentrations of women entrepreneurs, which can influence policy or support programs.Regional demographic disparities: The variation in sample sizes across districts may reflect differences in population density or business registration patterns.Further statistical modelling: Logistic regression or clustering analysis could reveal more profound insights into gender distribution patterns.

  7. f

    Comparison of the application of SHAP values and dendrograms in machine...

    • figshare.com
    xls
    Updated Jul 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alexander A. Huang; Samuel Y. Huang (2023). Comparison of the application of SHAP values and dendrograms in machine learning models. [Dataset]. http://doi.org/10.1371/journal.pone.0288819.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 20, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Alexander A. Huang; Samuel Y. Huang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison of the application of SHAP values and dendrograms in machine learning models.

  8. f

    Clustering results based on ward’s method.

    • figshare.com
    bin
    Updated Sep 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Karine Torosyan; Sicheng Wang; Elizabeth A. Mack; Jenna A. Van Fossen; Nathan Baker (2023). Clustering results based on ward’s method. [Dataset]. http://doi.org/10.1371/journal.pone.0291428.t003
    Explore at:
    binAvailable download formats
    Dataset updated
    Sep 18, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Karine Torosyan; Sicheng Wang; Elizabeth A. Mack; Jenna A. Van Fossen; Nathan Baker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    BackgroundThe fast-changing labor market highlights the need for an in-depth understanding of occupational mobility impacted by technological change. However, we lack a multidimensional classification scheme that considers similarities of occupations comprehensively, which prevents us from predicting employment trends and mobility across occupations. This study fills the gap by examining employment trends based on similarities between occupations.MethodWe first demonstrated a new method that clusters 756 occupation titles based on knowledge, skills, abilities, education, experience, training, activities, values, and interests. We used the Principal Component Analysis to categorize occupations in the Standard Occupational Classification, which is grouped into a four-level hierarchy. Then, we paired the occupation clusters with the occupational employment projections provided by the U.S. Bureau of Labor Statistics. We analyzed how employment would change and what factors affect the employment changes within occupation groups. Particularly, we specified factors related to technological changes.ResultsThe results reveal that technological change accounts for significant job losses in some clusters. This poses occupational mobility challenges for workers in these jobs at present. Job losses for nearly 60% of current employment will occur in low-skill, low-wage occupational groups. Meanwhile, many mid-skilled and highly skilled jobs are projected to grow in the next ten years.ConclusionOur results demonstrate the utility of our occupational classification scheme. Furthermore, it suggests a critical need for skills upgrading and workforce development for workers in declining jobs. Special attention should be paid to vulnerable workers, such as older individuals and minorities.

  9. Not seeing a result you expected?
    Learn how you can add new datasets to our index.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes (2023). Additional file 2 of Thresher: determining the number of clusters while removing outliers [Dataset]. http://doi.org/10.6084/m9.figshare.5768622.v1

Additional file 2 of Thresher: determining the number of clusters while removing outliers

Related Article
Explore at:
zipAvailable download formats
Dataset updated
Jun 3, 2023
Dataset provided by
figshare
Authors
Min Wang; Zachary B. Abrams; Steven M. Kornblau; Kevin R. Coombes
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

R Code for Analyses. This is a zip file containing all of the R code used to perform simulations and to analyze the breast cancer data. (ZIP 407 kb)

Search
Clear search
Close search
Google apps
Main menu