76 datasets found

f
Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...
tandf.figshare.com
tar
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25594361.v1
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
Dataset for Investigating Anomalies in Compute Clusters
zenodo.org
data.niaid.nih.gov
tar
Updated Nov 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore (2023). Dataset for Investigating Anomalies in Compute Clusters [Dataset]. http://doi.org/10.5281/zenodo.10058230
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10058230
Dataset updated
Nov 29, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Diana McSpadden; Diana McSpadden; Alanazi Yasir; Alanazi Yasir; Bryan Hess; Bryan Hess; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore; Jie Ren; Jie Ren; Malachi Schram; Malachi Schram; Evgenia Smirni; Evgenia Smirni; Laura Hild; Mark Jones; Yiyang Lu; Ahmed Mohammed; Wesley Moore
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Abstract
The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.
Background
Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.
The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.
Usage Notes
While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.
https://doi.org/10.48550/arXiv.2311.16129
Hierarchical clustering of 7 Million Proteins
kaggle.com
zip
Updated Aug 9, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rajasankar Viswanathan (2017). Hierarchical clustering of 7 Million Proteins [Dataset]. https://www.kaggle.com/rajasankar/hierarchical-clustering-of-7-million-proteins
Explore at:
zip(32711247 bytes)Available download formats
Dataset updated
Aug 9, 2017
Authors
Rajasankar Viswanathan
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Context

Clustering using distance needs all-against-all matching. New algorithm can cluster 7 Million proteins using approximate clustering under one hour.

Content

cat: contains Hierarchical sequence. protein_names : List of proteins in the group. Original data can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz

Inspiration

Researchers can use the data to find relationships between proteins more easily.

Data Structure Description

Data set has two files. protein_groupings file is the clustered data. This file has only names. Sequences for the names can be found in protein_name_letter file.

How it is created

Data is downloaded from the NCBI site and fasta format was converted into full length sequence. Sequences were fed into the clustering algorithm.

Cluster mappings

As this is the Hierarchical clustering, relationship between the sequences can be found by comparing the values in gn_list .

All the groups start with cluster_id:0 , split:0 and progress into matched splits. Difference between the splits would indicate that how much two sequences can match. Comparing the cluster_id would check if the sequences belong to same group or different group.

cluster_id = unique id for cluster. split = approximate similarity between the sequences. This is an absolute value. 63 would mean there is 63 letters would match between the sequences. Higher the value more similarity. inner_cluster_id = unique id to compare inner cluster matches. total clusters = number of clusters after approximate match is generated.

Due to space restrictions in Kaggle, this data set has only 9093 groups containing 129696 sequences.

One sequence may be in more than cluster because similarity is calculated as if all-against-all comparison is used.

Ex : For A, B, C , if A ~ B = 50, B~ C = 50 and A~C =0 then clustering will have two groups [A,B] and [B,C]

If you need full dataset for your research, contact me.

What is the issue with previous dataset

Previous dataset had issues with similarity comparisons between intra-clusters. Inner cluster comparison worked. This is fixed in the new version.
Clustering results of real datasets.
plos.figshare.com
xls
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xingqiong; Li Kang (2025). Clustering results of real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325161.t004
Dataset updated
Jun 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Wei Xingqiong; Li Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
Dataset and code to accompany the manuscript 'Consistency of clustering...
zenodo.org
bin, nc +1
Updated Oct 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford (2025). Dataset and code to accompany the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets' [Dataset]. http://doi.org/10.5281/zenodo.17227522
Explore at:
text/x-python, bin, ncAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.17227522
Dataset updated
Oct 2, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Rebecca Millington; Rebecca Millington; Dale Partridge; Dale Partridge; Helen R. Powley; Helen R. Powley; Gennadi Lessin; Gennadi Lessin; David Moffat; David Moffat; Jeremy Blackford; Jeremy Blackford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data and code needed to recreate all analyses and figures presented in the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets'.

Data

'all_data_for_paper.nc': model data, 2000-2004 mean of all variables used, provided at all depth levels.

'mesh_mask.nc': domain and depth data file to be used alongside model data.

ModelViz (code)

Tool to classify marine biogeochemical output from numerical models

Written by rmi, dapa & dmof

Preprocessing

preprocess_amm7_functions.py
Functions needed to run different preprocessing scripts.
preprocess_all_depths.py
First script to run. Extracts relevant variables and takes temporal mean for physical,biogeochemical and ecological variables. For physical - calculates PAR from qsr.
preprocess_amm7_mean.py
Use for surface biogeochemical and ecological sets (faster)
preprocess_DI_DA.py
Use for depth integrated, depth averaged and bottom biogeochemical and ecological sets. Can use for surface but slower.
preprocess_amm7_mean_one_depth.py
Extracts data at specified depth (numeric). Works for biogeochemical and ecological variables.
preprocess_physics.py
Takes all_depths_physics and calculates physics data at different depths.

Metrics

silhouette_nvars.py
Calculates silhouette score for inputs with different numbers of variables and clusters
rand_index.py
rand_index_depth.py
remove_one_var.py
Calculates rand index between cluster sets with one variable removed and original set

Clustering

Modelviz.py
Contains functions for applying clustering to data

Plotting

kmeans-paper-plots.ipynb
Produces figure 4
kmeans-paper-plots-illustrate-normalisation.ipynb
Produces figure 2
kmeans-paper-plots-depths.ipynb
Produces figures 5-7
plot_silhouette.ipynb
Produces figure 3
Gaia DR3 Data for Comparing Two Star Clusters
kaggle.com
zip
Updated Apr 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Austin Hinkel (2023). Gaia DR3 Data for Comparing Two Star Clusters [Dataset]. https://www.kaggle.com/datasets/austinhinkel/gaia-dr3-data-for-comparing-two-star-clusters
Explore at:
zip(370414 bytes)Available download formats
Dataset updated
Apr 4, 2023
Authors
Austin Hinkel
Description
Quick Summary:

This data is from Gaia Data Release 3 (DR3) and includes data on two star clusters: NGC 188 and M67. The data is used in my astronomy class, wherein students are tasked with determining which star cluster is older. (Update, 12-Sep-2023: I'm hoping to add a ML version of the data set that includes more field stars and divides the data into test and train sets. TBA.)

Files:

NGC 188 and M67 stars are separate csv files, with each row corresponding to a star. There are two versions for each star cluster:

A "filtered" version containing only parallax, apparent magnitude, and color measurements. This version is adequately filtered such that the vast majority of stars are likely to belong to the star cluster.

A "full" version containing the above information as well as proper motion data. This version of the data contains a number of field stars which do not belong to the clusters and must be filtered out. As the members of a star cluster move through space at similar proper motions, the filtered data set can be reproduced by keeping only the stars with the correct proper motions. You should be able to find some clustering in proper motion space to identify the star cluster membership.

Columns:

parallax (mas) - Parallax for use in distance calculations.

phot_g_mean_mag (mag) - G-band apparent magnitude.

bp_rp (mag) - Blue-pass minus Red-pass color.

pmra (mas/yr) - Proper Motion in the Right Ascension direction.

pmdec (mas/yr) - Proper Motion in the Declination direction.

For more on these quantities, please see https://gea.esac.esa.int/archive/documentation/GDR3/Gaia_archive/chap_datamodel/sec_dm_main_source_catalogue/ssec_dm_gaia_source.html

ADQL Queries of the Gaia Database:

M67:

SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec FROM gaiadr3.gaia_source WHERE gaia_source.l BETWEEN 215 AND 216 AND gaia_source.b BETWEEN 31.5 AND 32.5 AND gaia_source.phot_g_mean_mag < 18 AND gaia_source.parallax_over_error > 4 AND gaia_source.bp_rp IS NOT NULL

NGC 188:

SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec FROM gaiadr3.gaia_source WHERE gaia_source.l BETWEEN 122 AND 123.5 AND gaia_source.b BETWEEN 21.5 AND 23 AND gaia_source.phot_g_mean_mag < 18 AND gaia_source.parallax_over_error > 4 AND gaia_source.bp_rp IS NOT NULL

License:

Please see Gaia Archive's how to cite page for information regarding the use of the data.

The classroom activity and my code are free to use under an MIT License.
Reference list of 265 sources used for the discovery of relationships...
doi.pangaea.de
search.dataone.org
Updated Jul 8, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer (2012). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.785666
Dataset updated
Jul 8, 2012
Dataset provided by
PANGAEA
Authors
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 2006 - Dec 31, 2006
Area covered
Description
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
S
Data from: Deep Learning, Feature Learning, and Clustering Analysis for SEM...
scidb.cn
Updated Oct 17, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rossella Aversa; Piero Coronica; Cristiano De Nobili; Stefano Cozzini (2020). Deep Learning, Feature Learning, and Clustering Analysis for SEM Image Classification [Dataset]. http://doi.org/10.11922/sciencedb.j00104.00062
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.11922/sciencedb.j00104.00062
Dataset updated
Oct 17, 2020
Dataset provided by
Science Data Bank
Authors
Rossella Aversa; Piero Coronica; Cristiano De Nobili; Stefano Cozzini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
One table and six figures. Table 1 shows the number of images for each label in the 1μ–2μ data set, adopting the same labelling used in [11, 12, 13], reported here for completeness: 0 = Porous sponges, 1 = Patterned surfaces, 2 = Particles, 3 = Films and coated surfaces, 4 = Powders, 5 = Tips, 6 = Nanowires, 7 = Biological, 8 = MEMS devices and electrodes, 9 = Fibres.Figure 1 shows test accuracy as a function of the number of training epochs obtained by training from scratch Inception-v3 (magenta), Inception-v4 (orange), Inception-Resnet (green), and AlexNet (black) on SEM data set. All the models were trained with the best combination of hyperparameters, according to the memory capability of the available hardware. In Figure 2, Main: Test accuracy as a function of the number of training epochs obtained when fine tuning on the SEM data set Inception-v3 (magenta) and Inception-v4 (orange) starting from the ImageNet checkpoint, and Inception-v3 (blue) from the SEM checkpoint that, as expected, converges very rapidly. Inset: Test accuracy as a function of the number of training epochs obtained when performing feature extraction of Inception-v3 (magenta), Inception-v4 (orange), and Inception-Resnet (green) on the SEM data set starting from the ImageNet checkpoint. All the models were trained with the best combination of hyperparameters, according to the memory capability of the hardware available. Figure 3 shows intrinsic Dimension of the 1μ–2μ_1001 data set, varying the sample size, computed before autoencoding (green lines) and after autoencoding (red lines). The three brightness levels for each color correspond to the percentage of points used in the linear ﬁ t: 90%, 70%, and 50%. Figure 4 shows ddisc heatmap for a manually labelled subset of images. Figure 5 presents heatmaps of the distances obtained via Inception-v3. The image captions specify the methods used and indicate the correlation index with ddisc. Figure 6 shows NMI scores of the clustering obtained by the five hierarchical algorithms (solid lines) considered as a function of k, the number of clusters. The scores of the artificial scenarios are reported as orange (good case) and green (uniform case) dashed lines.
m
Data from: Data-driven E-commerce UI Personalization: Going Beyond Product...
data.mendeley.com
Updated Dec 29, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Adam Wasilewski (2023). Data-driven E-commerce UI Personalization: Going Beyond Product Recommendations [Dataset]. http://doi.org/10.17632/sxmgyvxpv9.1
Explore at:
Unique identifier
https://doi.org/10.17632/sxmgyvxpv9.1
Dataset updated
Dec 29, 2023
Authors
Adam Wasilewski
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset includes 1. online store customer behavior data (clickstream) from 1.04.-30.11.2023, used to cluster customers and evaluate the effectiveness of implemented modifications (catalog: learning-dataset) 2. clustering results to verify the effectiveness of implemented changes (catalog: clustering) 3. detailed data for calculation of macro-conversion indicators (catalog: macro-conversion-indicators) 3. detailed data for calculation of micro-conversion indicators (catalog: micro-conversion-indicators)
Blind method for discovering number of clusters in multidimensional datasets...
plos.figshare.com
docx
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Osbert C. Zalay (2023). Blind method for discovering number of clusters in multidimensional datasets by regression on linkage hierarchies generated from random data [Dataset]. http://doi.org/10.1371/journal.pone.0227788
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0227788
Dataset updated
Jun 4, 2023
Dataset provided by
PLOShttp://plos.org/
Authors
Osbert C. Zalay
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
i
Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...
catalog.ihsn.org
Updated Mar 29, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Hashemite Kingdom of Jordan Department of Statistics (DOS) (2019). Household Expenditure and Income Survey 2010, Economic Research Forum (ERF) Harmonization Data - Jordan [Dataset]. https://catalog.ihsn.org/index.php/catalog/7662
Explore at:
Dataset updated
Mar 29, 2019
Dataset authored and provided by
The Hashemite Kingdom of Jordan Department of Statistics (DOS)
Time period covered
2010 - 2011
Area covered
Jordan
Description
Abstract

The main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.

Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty

Geographic coverage

National

Analysis unit

Households

Individuals

Kind of data

Sample survey data [ssd]

Sampling procedure

The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.

A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.

It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.

Mode of data collection

Face-to-face [f2f]

Research instrument

General form

Expenditure on food commodities form

Expenditure on non-food commodities form

Cleaning operations

Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.

Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
d
Data from: Customer segmentation in e-commerce: a context-aware quality...
search.dataone.org
Updated Dec 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wasilewski, Adam (2023). Customer segmentation in e-commerce: a context-aware quality model for comparing clustering algorithms [Dataset]. http://doi.org/10.7910/DVN/Q1P3JV
Explore at:
Unique identifier
https://doi.org/10.7910/DVN/Q1P3JV
Dataset updated
Dec 16, 2023
Dataset provided by
Harvard Dataverse
Authors
Wasilewski, Adam
Description
The dataset includes: 1. learning data containing e-commerce user sessions (DATASET-X-session_visit.csv files) 2. clustering results (including metrics values and customer clusters), per algorithm tested 3. calculations (xlsx file)
Mapping color space to 100 color names
kaggle.com
Updated Jun 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
GoldNuss (2022). Mapping color space to 100 color names [Dataset]. https://www.kaggle.com/datasets/danela/mapping-color-space-to-100-color-names
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 24, 2022
Dataset provided by
Kaggle
Authors
GoldNuss
Description
Dataset to segment RGB colour space into 100 colour names. Each point in colour space can be assigned to a colour name by finding the nearest neighbour.

Data contains 100 colour names which correspond to well-distributed coordinates in RGB-colour space. The data were obtained by clustering more than 1000 colours from joined data sets from xkcd (https://xkcd.com/color/rgb/, https://xkcd.com/color/satfaces.txt) and the webcolors package (https://github.com/ubernostrum/webcolors) to 100 clusters using KMeans.
f
Statistical criteria to determine the optimal number of clusters.
datasetcatalog.nlm.nih.gov
plos.figshare.com
Updated Mar 15, 2019
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Park, Chan Kee; Kim, Eun Kyoung; Park, Hae-Young Lopilly (2019). Statistical criteria to determine the optimal number of clusters. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000093527
Explore at:
Dataset updated
Mar 15, 2019
Authors
Park, Chan Kee; Kim, Eun Kyoung; Park, Hae-Young Lopilly
Description
Statistical criteria to determine the optimal number of clusters.
F2 Driver Stats vs F1 Graduation
kaggle.com
zip
Updated May 5, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andry Rakotonjanabelo (2025). F2 Driver Stats vs F1 Graduation [Dataset]. https://www.kaggle.com/datasets/andryrakotonjanabelo/f2-driver-stats-vs-f1-graduation/discussion
Explore at:
zip(4007 bytes)Available download formats
Dataset updated
May 5, 2025
Authors
Andry Rakotonjanabelo
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Description: F2 Driver Stats vs F1 Graduation

The F2 Driver Stats vs F1 Graduation dataset aggregates per-driver performance metrics from the FIA Formula 2 Championship (2018–2019) and labels whether each driver eventually reached Formula 1.

Data Source: This dataset was created by processing and aggregating data originally sourced from the Formula 2 Dataset (2018-2019) by alarchemn on Kaggle. Modifications include calculating aggregate statistics per driver and adding the REACHED_F1 and cluster columns.

Each row represents one driver and includes the following columns:

PILOT NAME: Driver’s name

LAPS: Total laps completed across all races during the specified period.

AVG_GAP: The average time gap (in seconds) between the driver and the race winner across all races they finished.

AVG_KPH: The driver's average race speed (in kilometers per hour) calculated over all completed race laps.

AVG_POS: The driver's average finishing position across all races they completed.

AVG_TIME_seconds: The driver's mean total race time (in seconds) for completed races.

AVG_BEST_seconds: The driver's mean best lap time (in seconds) across all races where a best lap was recorded.

REACHED_F1: A binary indicator where 1 signifies the driver competed in at least one official Formula 1 Grand Prix race after their F2 stint, and 0 signifies they did not.

cluster: An identifier for the cluster assigned to the driver based on an unsupervised K-Means clustering algorithm (using a custom NumPy implementation) applied to their performance metrics. The dataset contains 3 distinct clusters.

Potential Uses:

You can use this dataset to:

Develop and evaluate supervised machine learning models to predict whether an F2 driver will graduate to Formula 1 based on their F2 performance statistics.

Analyze the characteristics of the different driver clusters identified by the K-Means algorithm and explore potential correlations between cluster membership and F1 progression.

Conduct statistical analyses and create visualizations to investigate the relationships between specific on-track performance metrics (like average position, speed, or gap to winner) and a driver's likelihood of reaching F1.

Compare the performance profiles of drivers who successfully transitioned to F1 versus those who did not.
Milky Way Star Clusters Catalog - Dataset - NASA Open Data Portal
data.nasa.gov
Updated Apr 1, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Milky Way Star Clusters Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/milky-way-star-clusters-catalog
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
Although they are the main constituents of the Galactic disk population, for half of the open clusters in the Milky Way reported in the literature nothing is known except the raw position and an approximate size. The main goal of this study is to determine a full set of uniform spatial, structural, kinematic, and astrophysical parameters for as many known open clusters as possible. On the basis of stellar data from PPMXL and 2MASS, the authors used a dedicated data-processing pipeline to determine kinematic and photometric membership probabilities for stars in a cluster region. For an input list of 3,784 targets from the literature, they confirm that 3,006 are real objects, the vast majority of them are open clusters, but associations and globular clusters are also present. For each confirmed object, the authors determined the exact position of the cluster center, the apparent size, proper motion, distance, color excess, and age. For about 1,500 clusters, these basic astrophysical parameters have been determined for the first time. For the bulk of the clusters the authors also derived the tidal radii. In addition, they estimated average radial velocities for more than 30% of the confirmed clusters. The present sample (called MWSC) reaches both the central parts of the Milky Way and its outer regions. It is almost complete up to 1.8 kpc from the Sun and also covers the neighboring spiral arms. However, for a small subset of the oldest open clusters (ages more than ~ 1 Gyr), the authors found some evidence of incompleteness within about 1 kpc from the Sun. This table contains the list of 3,006 Milky Way stellar clusters (MWSC) found in the 2MAst (2MASS with Astrometry) catalog presented in Paper II of this series (these clusters have source numbers below 4000), together with an additional 139 new open clusters (these clusters have source numbers between 5000 and 6000) found by the authors at high Galactic latitudes (|b_II_| > 18.5 degrees) which were presented in Paper III of the series, and an additional 63 new open clusters (these clusters have source numbers between 4000 and 5000) which were presented in Paper IV of the series. The target list in Paper II from which the 3,006 open clusters was contained was compiled on the basis of present-day lists of open, globular and candidate clusters. The list of new high-latitude open clusters in Paper III was obtained from a target list of 714 density enhancements found using the 2MASS Catalog. The list of new open clusters in Paper IV was obtained from an initial list of 692 compact cluster candidates which were found by the authors by conducting an almost global search of the sky (they excluded the portions of the sky with |b_II_| < 5 degrees) in the PPMXL and the UCAC4 proper-motion catalogs. For confirmed clusters, the authors determined a homogeneous set of astrophysical parameters such as membership, angular radii of the main morphological parts, mean cluster proper motions, distances, reddenings, ages, tidal parameters, and sometimes radial velocities. This table was created by the HEASARC in February 2014 based on the list of open clusters given in CDS Catalog J/A+A/558/A53 files catalog.dat and notes.dat. It was updated in September 2014 with 139 additional star clusters from CDS Catalog J/A+A/568/A51 files catalog.dat and notes.dat. It was further updated in October 2015 with 63 additional star clusters from CDS Catalog J/A+A/581/A39 files catalog.dat and notes.dat. Note that this table does not include the information on candidates which turned out not to be open clusters which was also contained in these catalogs. This is a service provided by NASA HEASARC .
d
Data from: A Cluster Randomized Controlled Trial of the Safe Public Spaces...
catalog.data.gov
icpsr.umich.edu
Updated Mar 12, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institute of Justice (2025). A Cluster Randomized Controlled Trial of the Safe Public Spaces in Schools Program, New York City, 2016-2018 [Dataset]. https://catalog.data.gov/dataset/a-cluster-randomized-controlled-trial-of-the-safe-public-spaces-in-schools-program-ne-2016-f67d7
Explore at:
Dataset updated
Mar 12, 2025
Dataset provided by
National Institute of Justice
Area covered
New York
Description
This study tests the efficacy of an intervention--Safe Public Spaces (SPS) -- focused on improving the safety of public spaces in schools, such as hallways, cafeterias, and stairwells. Twenty-four schools with middle grades in a large urban area were recruited for participation and were pair-matched and then assigned to either treatment or control. The study comprises four components: an implementation evaluation, a cost study, an impact study, and a community crime study. Community-crime-study: The community crime study used the arrest of juveniles from the NYPD (New York Police Department) data. The data can be found at (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). Data include all arrest for the juvenile crime during the life of the intervention. The 12 matched schools were identified and geo-mapped using Quantum GIS (QGIS) 3.8 software. Block groups in the 2010 US Census in which the schools reside and neighboring block groups were mapped into micro-areas. This resulted in twelve experimental school blocks and 11 control blocks which the schools reside (two of the control schools existed in the same census block group). Additionally, neighboring blocks using were geo-mapped into 70 experimental and 77 control adjacent block groups (see map). Finally, juvenile arrests were mapped into experimental and control areas. Using the ARIMA time-series method in Stata 15 statistical software package, arrest data were analyzed to compare the change in juvenile arrests in the experimental and control sites. Cost-study: For the cost study, information from the implementing organization (Engaging Schools) was combined with data from phone conversations and follow-up communications with staff in school sites to populate a Resource Cost Model. The Resource Cost Model Excel file will be provided for archiving. This file contains details on the staff time and materials allocated to the intervention, as well as the NYC prices in 2018 US dollars associated with each element. Prices were gathered from multiple sources, including actual NYC DOE data on salaries for position types for which these data were available and district salary schedules for the other staff types. Census data were used to calculate benefits. Impact-evaluation: The impact evaluation was conducted using data from the Research Alliance for New York City Schools. Among the core functions of the Research Alliance is maintaining a unique archive of longitudinal data on NYC schools to support ongoing research. The Research Alliance builds and maintains an archive of longitudinal data about NYC schools. Their agreement with the New York City Department of Education (NYC DOE) outlines the data they receive, the process they use to obtain it, and the security measures to keep it safe. Implementation-study: The implementation study comprises the baseline survey and observation data. Interview transcripts are not archived.
Million Song Data Analysis 2
kaggle.com
zip
Updated Jun 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zirian Afandy (2024). Million Song Data Analysis 2 [Dataset]. https://www.kaggle.com/datasets/ziriantahirli/million-song-data-analysis-2/suggestions
Explore at:
zip(81159531 bytes)Available download formats
Dataset updated
Jun 29, 2024
Authors
Zirian Afandy
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Did We Solve the Problem? The objective of this analysis was to predict high streaming counts on Spotify and perform a detailed cluster analysis to understand user behavior. Here’s a summary of how we addressed each part of the objective:

Prediction of High Streaming Counts:

Implemented Multiple Models: We utilized several machine learning models including Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). Comparison and Evaluation: These models were evaluated based on classification metrics like accuracy, precision, recall, and F1-score. The Gradient Boosting and Random Forest models were found to be the most effective in predicting high streaming counts. Cluster Analysis:

K-means Clustering: We applied K-means clustering to segment users into three clusters based on their listening behavior. Detailed Characterization: Each cluster was analyzed to understand the distinct characteristics, such as average playtime, skip rate, offline usage, and shuffle usage. Visualizations: Histograms and scatter plots were used to visualize the distributions and relationships within each cluster. Results and Insights Effective Models: The Gradient Boosting and Random Forest models provided the highest accuracy and balanced performance for predicting high streaming counts. User Segmentation: The cluster analysis revealed three distinct user segments: Cluster 1: Users with longer playtimes and lower skip rates. Cluster 2: Users with moderate playtimes and skip rates. Cluster 3: Users with shorter playtimes and higher skip rates. These insights can be leveraged for targeted marketing, personalized recommendations, and improving user engagement on Spotify.

Conclusion Yes, we solved the problem. We successfully predicted high streaming counts using effective machine learning models and provided a detailed cluster analysis to understand user behavior. The analysis offers valuable insights for enhancing Spotify’s recommendation system and user experience.
GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version...
zenodo.org
tiff
Updated Sep 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Luling Liu; Xin Cao; Xin Cao; Shijie Li; Na Jie; Luling Liu; Shijie Li; Na Jie (2024). GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version 2.0-test-alpha) [Dataset]. http://doi.org/10.5281/zenodo.11071249
Explore at:
tiffAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11071249
Dataset updated
Sep 4, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Luling Liu; Xin Cao; Xin Cao; Shijie Li; Na Jie; Luling Liu; Shijie Li; Na Jie
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Data Usage Notice

This version is not recommended for download. Please check the newest version.

We would like to inform you that the updated GlobPOP dataset (2021-2022) have been available in version 2.0. The GlobPOP dataset (2021-2022) in the current version is not recommended for your work. The GlobPOP dataset (1990-2020) in the current version is the same as version 1.0.

Thank you for your continued support of the GlobPOP.

If you encounter any issues, please contact us via email at lulingliu@mail.bnu.edu.cn.

Introduction

Continuously monitoring global population spatial dynamics is essential for implementing effective policies related to sustainable development, such as epidemiology, urban planning, and global inequality.

Here, we present GlobPOP, a new continuous global gridded population product with a high-precision spatial resolution of 30 arcseconds from 1990 to 2020. Our data-fusion framework is based on cluster analysis and statistical learning approaches, which intends to fuse the existing five products(Global Human Settlements Layer Population (GHS-POP), Global Rural Urban Mapping Project (GRUMP), Gridded Population of the World Version 4 (GPWv4), LandScan Population datasets and WorldPop datasets to a new continuous global gridded population (GlobPOP). The spatial validation results demonstrate that the GlobPOP dataset is highly accurate. To validate the temporal accuracy of GlobPOP at the country level, we have developed an interactive web application, accessible at https://globpop.shinyapps.io/GlobPOP/, where data users can explore the country-level population time-series curves of interest and compare them with census data.

With the availability of GlobPOP dataset in both population count and population density formats, researchers and policymakers can leverage our dataset to conduct time-series analysis of population and explore the spatial patterns of population development at various scales, ranging from national to city level.

Data description

The product is produced in 30 arc-seconds resolution(approximately 1km in equator) and is made available in GeoTIFF format. There are two population formats, one is the 'Count'(Population count per grid) and another is the 'Density'(Population count per square kilometer each grid)

Each GeoTIFF filename has 5 fields that are separated by an underscore "_". A filename extension follows these fields. The fields are described below with the example filename:

GlobPOP_Count_30arc_1990_I32

Field 1: GlobPOP(Global gridded population)
Field 2: Pixel unit is population "Count" or population "Density"
Field 3: Spatial resolution is 30 arc seconds
Field 4: Year "1990"
Field 5: Data type is I32(Int 32) or F32(Float32)

More information

Please refer to the paper for detailed information:

Liu, L., Cao, X., Li, S. et al. A 31-year (1990–2020) global gridded population dataset generated by cluster analysis and statistical learning. Sci Data 11, 124 (2024). https://doi.org/10.1038/s41597-024-02913-0.

The fully reproducible codes are publicly available at GitHub: https://github.com/lulingliu/GlobPOP.
Archive of Chandra Cluster Entropy Profile Tables (ACCEPT) Catalog - Dataset...
data.nasa.gov
Updated Apr 1, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
nasa.gov (2025). Archive of Chandra Cluster Entropy Profile Tables (ACCEPT) Catalog - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/archive-of-chandra-cluster-entropy-profile-tables-accept-catalog
Explore at:
Dataset updated
Apr 1, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This table, the Archive of Chandra Cluster Entropy Profile Tables (ACCEPT) Catalog, contains the radial entropy profiles of the intracluster medium (ICM) for a collection of 239 clusters taken from the Chandra X-ray Observatory's Data Archive. Entropy is of great interest because it controls ICM global properties and records the thermal history of a cluster. The authors find that most ICM entropy profiles are well fitted by a model which is a power law at large radii and approaches a constant value at small radii: K(r) = K₀ + K₁₀₀ (r/100 kpc)^alpha, where K₀ quantifies the typical excess of core entropy above the best-fitting power law found at larger radii. The authors also show that the K₀ distributions of both the full archival sample and the primary Highest X-Ray Flux Galaxy Cluster Sample of Reiprich (2001, Ph.D. thesis) are bimodal with a distinct gap between K₀ ~ 30 - 50 keV cm² and population peaks at K₀ ~ 15 keV cm² and K₀ ~ 150 keV cm². The effects of point-spread function smearing and angular resolution on best-fit K₀ values are investigated using mock Chandra observations and degraded entropy profiles, respectively. The authors find that neither of these effects is sufficient to explain the entropy-profile flattening they measure at small radii. The influence of profile curvature and the number of radial bins on the best-fit K₀ is also considered, and they find no indication that K₀ is significantly impacted by either. All data and results associated with this work are publicly available via the project web site http://www.pa.msu.edu/astro/MC2/accept/. The sample is collected from observations taken with the Chandra X-ray Observatory and which were publicly available in the CDA (Chandra Data Archive) as of 2008 August. This table was created by the HEASARC in January 2012 based on CDS Catalog J/ApJS/182/12 files table1.dat and table5.dat. This is a service provided by NASA HEASARC .

Facebook

Twitter

Click to copy link

Link copied

Cite

Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25594361.v1

Dataset updated

Jun 11, 2024

Dataset provided by

Taylor & Francis

Authors

Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

Dataset for Investigating Anomalies in Compute Clusters

Hierarchical clustering of 7 Million Proteins

Context

Content

Inspiration

Data Structure Description

How it is created

Cluster mappings

What is the issue with previous dataset

Clustering results of real datasets.

Dataset and code to accompany the manuscript 'Consistency of clustering...

Data

ModelViz (code)

Preprocessing

Metrics

Clustering

Plotting

Gaia DR3 Data for Comparing Two Star Clusters

Quick Summary:

Files:

Columns:

ADQL Queries of the Gaia Database:

M67:

NGC 188:

License:

Reference list of 265 sources used for the discovery of relationships...

Data from: Deep Learning, Feature Learning, and Clustering Analysis for SEM...

Data from: Data-driven E-commerce UI Personalization: Going Beyond Product...

Blind method for discovering number of clusters in multidimensional datasets...

Household Expenditure and Income Survey 2010, Economic Research Forum (ERF)...

Abstract

Geographic coverage

Analysis unit

Kind of data

Sampling procedure

Mode of data collection

Research instrument

Cleaning operations

Data from: Customer segmentation in e-commerce: a context-aware quality...

Mapping color space to 100 color names

Statistical criteria to determine the optimal number of clusters.

F2 Driver Stats vs F1 Graduation

Dataset Description: F2 Driver Stats vs F1 Graduation

Milky Way Star Clusters Catalog - Dataset - NASA Open Data Portal

Data from: A Cluster Randomized Controlled Trial of the Safe Public Spaces...

Million Song Data Analysis 2

GlobPOP: A 33-year (1990-2022) global gridded population dataset (Version...

Data Usage Notice

This version is not recommended for download. Please check the newest version.

Introduction

Data description

More information

Archive of Chandra Cluster Entropy Profile Tables (ACCEPT) Catalog - Dataset...

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure