Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Abstract
The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.
Background
Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.
The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.
Usage Notes
While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Clustering using distance needs all-against-all matching. New algorithm can cluster 7 Million proteins using approximate clustering under one hour.
cat: contains Hierarchical sequence. protein_names : List of proteins in the group. Original data can be downloaded from ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/env_nr.gz
Researchers can use the data to find relationships between proteins more easily.
Data set has two files. protein_groupings file is the clustered data. This file has only names. Sequences for the names can be found in protein_name_letter file.
Data is downloaded from the NCBI site and fasta format was converted into full length sequence. Sequences were fed into the clustering algorithm.
As this is the Hierarchical clustering, relationship between the sequences can be found by comparing the values in gn_list .
All the groups start with cluster_id:0 , split:0 and progress into matched splits. Difference between the splits would indicate that how much two sequences can match. Comparing the cluster_id would check if the sequences belong to same group or different group.
cluster_id = unique id for cluster. split = approximate similarity between the sequences. This is an absolute value. 63 would mean there is 63 letters would match between the sequences. Higher the value more similarity. inner_cluster_id = unique id to compare inner cluster matches. total clusters = number of clusters after approximate match is generated.
Due to space restrictions in Kaggle, this data set has only 9093 groups containing 129696 sequences.
One sequence may be in more than cluster because similarity is calculated as if all-against-all comparison is used.
Ex : For A, B, C , if A ~ B = 50, B~ C = 50 and A~C =0 then clustering will have two groups [A,B] and [B,C]
If you need full dataset for your research, contact me.
Previous dataset had issues with similarity comparisons between intra-clusters. Inner cluster comparison worked. This is fixed in the new version.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Data and code needed to recreate all analyses and figures presented in the manuscript 'Consistency of clustering analysis of complex 3D ocean datasets'.
'all_data_for_paper.nc': model data, 2000-2004 mean of all variables used, provided at all depth levels.
'mesh_mask.nc': domain and depth data file to be used alongside model data.
Tool to classify marine biogeochemical output from numerical models
Written by rmi, dapa & dmof
preprocess_amm7_functions.py
Functions needed to run different preprocessing scripts.
preprocess_all_depths.py
First script to run. Extracts relevant variables and takes temporal mean for physical,biogeochemical and ecological variables. For physical - calculates PAR from qsr.
preprocess_amm7_mean.py
Use for surface biogeochemical and ecological sets (faster)
preprocess_DI_DA.py
Use for depth integrated, depth averaged and bottom biogeochemical and ecological sets. Can use for surface but slower.
preprocess_amm7_mean_one_depth.py
Extracts data at specified depth (numeric). Works for biogeochemical and ecological variables.
preprocess_physics.py
Takes all_depths_physics and calculates physics data at different depths.
silhouette_nvars.py
Calculates silhouette score for inputs with different numbers of variables and clusters
rand_index.py
rand_index_depth.py
remove_one_var.py
Calculates rand index between cluster sets with one variable removed and original set
Modelviz.py
Contains functions for applying clustering to data
kmeans-paper-plots.ipynb
Produces figure 4
kmeans-paper-plots-illustrate-normalisation.ipynb
Produces figure 2
kmeans-paper-plots-depths.ipynb
Produces figures 5-7
plot_silhouette.ipynb
Produces figure 3
Facebook
TwitterThis data is from Gaia Data Release 3 (DR3) and includes data on two star clusters: NGC 188 and M67. The data is used in my astronomy class, wherein students are tasked with determining which star cluster is older. (Update, 12-Sep-2023: I'm hoping to add a ML version of the data set that includes more field stars and divides the data into test and train sets. TBA.)
NGC 188 and M67 stars are separate csv files, with each row corresponding to a star. There are two versions for each star cluster:
For more on these quantities, please see https://gea.esac.esa.int/archive/documentation/GDR3/Gaia_archive/chap_datamodel/sec_dm_main_source_catalogue/ssec_dm_gaia_source.html
SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec
FROM gaiadr3.gaia_source
WHERE
gaia_source.l BETWEEN 215 AND 216 AND
gaia_source.b BETWEEN 31.5 AND 32.5 AND
gaia_source.phot_g_mean_mag < 18 AND
gaia_source.parallax_over_error > 4 AND
gaia_source.bp_rp IS NOT NULL
SELECT gaia_source.parallax,gaia_source.phot_g_mean_mag,gaia_source.bp_rp,gaia_source.pmra,gaia_source.pmdec
FROM gaiadr3.gaia_source
WHERE
gaia_source.l BETWEEN 122 AND 123.5 AND
gaia_source.b BETWEEN 21.5 AND 23 AND
gaia_source.phot_g_mean_mag < 18 AND
gaia_source.parallax_over_error > 4 AND
gaia_source.bp_rp IS NOT NULL
Please see Gaia Archive's how to cite page for information regarding the use of the data.
The classroom activity and my code are free to use under an MIT License.
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
One table and six figures. Table 1 shows the number of images for each label in the 1μ–2μ data set, adopting the same labelling used in [11, 12, 13], reported here for completeness: 0 = Porous sponges, 1 = Patterned surfaces, 2 = Particles, 3 = Films and coated surfaces, 4 = Powders, 5 = Tips, 6 = Nanowires, 7 = Biological, 8 = MEMS devices and electrodes, 9 = Fibres.Figure 1 shows test accuracy as a function of the number of training epochs obtained by training from scratch Inception-v3 (magenta), Inception-v4 (orange), Inception-Resnet (green), and AlexNet (black) on SEM data set. All the models were trained with the best combination of hyperparameters, according to the memory capability of the available hardware. In Figure 2, Main: Test accuracy as a function of the number of training epochs obtained when fine tuning on the SEM data set Inception-v3 (magenta) and Inception-v4 (orange) starting from the ImageNet checkpoint, and Inception-v3 (blue) from the SEM checkpoint that, as expected, converges very rapidly. Inset: Test accuracy as a function of the number of training epochs obtained when performing feature extraction of Inception-v3 (magenta), Inception-v4 (orange), and Inception-Resnet (green) on the SEM data set starting from the ImageNet checkpoint. All the models were trained with the best combination of hyperparameters, according to the memory capability of the hardware available. Figure 3 shows intrinsic Dimension of the 1μ–2μ_1001 data set, varying the sample size, computed before autoencoding (green lines) and after autoencoding (red lines). The three brightness levels for each color correspond to the percentage of points used in the linear fi t: 90%, 70%, and 50%. Figure 4 shows ddisc heatmap for a manually labelled subset of images. Figure 5 presents heatmaps of the distances obtained via Inception-v3. The image captions specify the methods used and indicate the correlation index with ddisc. Figure 6 shows NMI scores of the clustering obtained by the five hierarchical algorithms (solid lines) considered as a function of k, the number of clusters. The scores of the artificial scenarios are reported as orange (good case) and green (uniform case) dashed lines.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset includes 1. online store customer behavior data (clickstream) from 1.04.-30.11.2023, used to cluster customers and evaluate the effectiveness of implemented modifications (catalog: learning-dataset) 2. clustering results to verify the effectiveness of implemented changes (catalog: clustering) 3. detailed data for calculation of macro-conversion indicators (catalog: macro-conversion-indicators) 3. detailed data for calculation of micro-conversion indicators (catalog: micro-conversion-indicators)
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Determining intrinsic number of clusters in a multidimensional dataset is a commonly encountered problem in exploratory data analysis. Unsupervised clustering algorithms often rely on specification of cluster number as an input parameter. However, this is typically not known a priori. Many methods have been proposed to estimate cluster number, including statistical and information-theoretic approaches such as the gap statistic, but these methods are not always reliable when applied to non-normally distributed datasets containing outliers or noise. In this study, I propose a novel method called hierarchical linkage regression, which uses regression to estimate the intrinsic number of clusters in a multidimensional dataset. The method operates on the hypothesis that the organization of data into clusters can be inferred from the hierarchy generated by partitioning the dataset, and therefore does not directly depend on the specific values of the data or their distribution, but on their relative ranking within the partitioned set. Moreover, the technique does not require empirical data to train on, but can use synthetic data generated from random distributions to fit regression coefficients. The trained hierarchical linkage regression model is able to infer cluster number in test datasets of varying complexity and differing distributions, for image, text and numeric data, using the same regression model without retraining. The method performs favourably against other cluster number estimation techniques, and is also robust to parameter changes, as demonstrated by sensitivity analysis. The apparent robustness and generalizability of hierarchical linkage regression make it a promising tool for unsupervised exploratory data analysis and discovery.
Facebook
TwitterThe main objective of the HEIS survey is to obtain detailed data on household expenditure and income, linked to various demographic and socio-economic variables, to enable computation of poverty indices and determine the characteristics of the poor and prepare poverty maps. Therefore, to achieve these goals, the sample had to be representative on the sub-district level. The raw survey data provided by the Statistical Office was cleaned and harmonized by the Economic Research Forum, in the context of a major research project to develop and expand knowledge on equity and inequality in the Arab region. The main focus of the project is to measure the magnitude and direction of change in inequality and to understand the complex contributing social, political and economic forces influencing its levels. However, the measurement and analysis of the magnitude and direction of change in this inequality cannot be consistently carried out without harmonized and comparable micro-level data on income and expenditures. Therefore, one important component of this research project is securing and harmonizing household surveys from as many countries in the region as possible, adhering to international statistics on household living standards distribution. Once the dataset has been compiled, the Economic Research Forum makes it available, subject to confidentiality agreements, to all researchers and institutions concerned with data collection and issues of inequality.
Data collected through the survey helped in achieving the following objectives: 1. Provide data weights that reflect the relative importance of consumer expenditure items used in the preparation of the consumer price index 2. Study the consumer expenditure pattern prevailing in the society and the impact of demographic and socio-economic variables on those patterns 3. Calculate the average annual income of the household and the individual, and assess the relationship between income and different economic and social factors, such as profession and educational level of the head of the household and other indicators 4. Study the distribution of individuals and households by income and expenditure categories and analyze the factors associated with it 5. Provide the necessary data for the national accounts related to overall consumption and income of the household sector 6. Provide the necessary income data to serve in calculating poverty indices and identifying the poor characteristics as well as drawing poverty maps 7. Provide the data necessary for the formulation, follow-up and evaluation of economic and social development programs, including those addressed to eradicate poverty
National
Sample survey data [ssd]
The Household Expenditure and Income survey sample for 2010, was designed to serve the basic objectives of the survey through providing a relatively large sample in each sub-district to enable drawing a poverty map in Jordan. The General Census of Population and Housing in 2004 provided a detailed framework for housing and households for different administrative levels in the country. Jordan is administratively divided into 12 governorates, each governorate is composed of a number of districts, each district (Liwa) includes one or more sub-district (Qada). In each sub-district, there are a number of communities (cities and villages). Each community was divided into a number of blocks. Where in each block, the number of houses ranged between 60 and 100 houses. Nomads, persons living in collective dwellings such as hotels, hospitals and prison were excluded from the survey framework.
A two stage stratified cluster sampling technique was used. In the first stage, a cluster sample proportional to the size was uniformly selected, where the number of households in each cluster was considered the weight of the cluster. At the second stage, a sample of 8 households was selected from each cluster, in addition to another 4 households selected as a backup for the basic sample, using a systematic sampling technique. Those 4 households were sampled to be used during the first visit to the block in case the visit to the original household selected is not possible for any reason. For the purposes of this survey, each sub-district was considered a separate stratum to ensure the possibility of producing results on the sub-district level. In this respect, the survey framework adopted that provided by the General Census of Population and Housing Census in dividing the sample strata. To estimate the sample size, the coefficient of variation and the design effect of the expenditure variable provided in the Household Expenditure and Income Survey for the year 2008 was calculated for each sub-district. These results were used to estimate the sample size on the sub-district level so that the coefficient of variation for the expenditure variable in each sub-district is less than 10%, at a minimum, of the number of clusters in the same sub-district (6 clusters). This is to ensure adequate presentation of clusters in different administrative areas to enable drawing an indicative poverty map.
It should be noted that in addition to the standard non response rate assumed, higher rates were expected in areas where poor households are concentrated in major cities. Therefore, those were taken into consideration during the sampling design phase, and a higher number of households were selected from those areas, aiming at well covering all regions where poverty spreads.
Face-to-face [f2f]
Raw Data: - Organizing forms/questionnaires: A compatible archive system was used to classify the forms according to different rounds throughout the year. A registry was prepared to indicate different stages of the process of data checking, coding and entry till forms were back to the archive system. - Data office checking: This phase was achieved concurrently with the data collection phase in the field where questionnaires completed in the field were immediately sent to data office checking phase. - Data coding: A team was trained to work on the data coding phase, which in this survey is only limited to education specialization, profession and economic activity. In this respect, international classifications were used, while for the rest of the questions, coding was predefined during the design phase. - Data entry/validation: A team consisting of system analysts, programmers and data entry personnel were working on the data at this stage. System analysts and programmers started by identifying the survey framework and questionnaire fields to help build computerized data entry forms. A set of validation rules were added to the entry form to ensure accuracy of data entered. A team was then trained to complete the data entry process. Forms prepared for data entry were provided by the archive department to ensure forms are correctly extracted and put back in the archive system. A data validation process was run on the data to ensure the data entered is free of errors. - Results tabulation and dissemination: After the completion of all data processing operations, ORACLE was used to tabulate the survey final results. Those results were further checked using similar outputs from SPSS to ensure that tabulations produced were correct. A check was also run on each table to guarantee consistency of figures presented, together with required editing for tables' titles and report formatting.
Harmonized Data: - The Statistical Package for Social Science (SPSS) was used to clean and harmonize the datasets. - The harmonization process started with cleaning all raw data files received from the Statistical Office. - Cleaned data files were then merged to produce one data file on the individual level containing all variables subject to harmonization. - A country-specific program was generated for each dataset to generate/compute/recode/rename/format/label harmonized variables. - A post-harmonization cleaning process was run on the data. - Harmonized data was saved on the household as well as the individual level, in SPSS and converted to STATA format.
Facebook
TwitterThe dataset includes: 1. learning data containing e-commerce user sessions (DATASET-X-session_visit.csv files) 2. clustering results (including metrics values and customer clusters), per algorithm tested 3. calculations (xlsx file)
Facebook
TwitterDataset to segment RGB colour space into 100 colour names. Each point in colour space can be assigned to a colour name by finding the nearest neighbour.
Data contains 100 colour names which correspond to well-distributed coordinates in RGB-colour space. The data were obtained by clustering more than 1000 colours from joined data sets from xkcd (https://xkcd.com/color/rgb/, https://xkcd.com/color/satfaces.txt) and the webcolors package (https://github.com/ubernostrum/webcolors) to 100 clusters using KMeans.
Facebook
TwitterStatistical criteria to determine the optimal number of clusters.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The F2 Driver Stats vs F1 Graduation dataset aggregates per-driver performance metrics from the FIA Formula 2 Championship (2018–2019) and labels whether each driver eventually reached Formula 1.
Data Source: This dataset was created by processing and aggregating data originally sourced from the Formula 2 Dataset (2018-2019) by alarchemn on Kaggle. Modifications include calculating aggregate statistics per driver and adding the REACHED_F1 and cluster columns.
Each row represents one driver and includes the following columns:
1 signifies the driver competed in at least one official Formula 1 Grand Prix race after their F2 stint, and 0 signifies they did not.Potential Uses:
You can use this dataset to:
Facebook
TwitterAlthough they are the main constituents of the Galactic disk population, for half of the open clusters in the Milky Way reported in the literature nothing is known except the raw position and an approximate size. The main goal of this study is to determine a full set of uniform spatial, structural, kinematic, and astrophysical parameters for as many known open clusters as possible. On the basis of stellar data from PPMXL and 2MASS, the authors used a dedicated data-processing pipeline to determine kinematic and photometric membership probabilities for stars in a cluster region. For an input list of 3,784 targets from the literature, they confirm that 3,006 are real objects, the vast majority of them are open clusters, but associations and globular clusters are also present. For each confirmed object, the authors determined the exact position of the cluster center, the apparent size, proper motion, distance, color excess, and age. For about 1,500 clusters, these basic astrophysical parameters have been determined for the first time. For the bulk of the clusters the authors also derived the tidal radii. In addition, they estimated average radial velocities for more than 30% of the confirmed clusters. The present sample (called MWSC) reaches both the central parts of the Milky Way and its outer regions. It is almost complete up to 1.8 kpc from the Sun and also covers the neighboring spiral arms. However, for a small subset of the oldest open clusters (ages more than ~ 1 Gyr), the authors found some evidence of incompleteness within about 1 kpc from the Sun. This table contains the list of 3,006 Milky Way stellar clusters (MWSC) found in the 2MAst (2MASS with Astrometry) catalog presented in Paper II of this series (these clusters have source numbers below 4000), together with an additional 139 new open clusters (these clusters have source numbers between 5000 and 6000) found by the authors at high Galactic latitudes (|b_II_| > 18.5 degrees) which were presented in Paper III of the series, and an additional 63 new open clusters (these clusters have source numbers between 4000 and 5000) which were presented in Paper IV of the series. The target list in Paper II from which the 3,006 open clusters was contained was compiled on the basis of present-day lists of open, globular and candidate clusters. The list of new high-latitude open clusters in Paper III was obtained from a target list of 714 density enhancements found using the 2MASS Catalog. The list of new open clusters in Paper IV was obtained from an initial list of 692 compact cluster candidates which were found by the authors by conducting an almost global search of the sky (they excluded the portions of the sky with |b_II_| < 5 degrees) in the PPMXL and the UCAC4 proper-motion catalogs. For confirmed clusters, the authors determined a homogeneous set of astrophysical parameters such as membership, angular radii of the main morphological parts, mean cluster proper motions, distances, reddenings, ages, tidal parameters, and sometimes radial velocities. This table was created by the HEASARC in February 2014 based on the list of open clusters given in CDS Catalog J/A+A/558/A53 files catalog.dat and notes.dat. It was updated in September 2014 with 139 additional star clusters from CDS Catalog J/A+A/568/A51 files catalog.dat and notes.dat. It was further updated in October 2015 with 63 additional star clusters from CDS Catalog J/A+A/581/A39 files catalog.dat and notes.dat. Note that this table does not include the information on candidates which turned out not to be open clusters which was also contained in these catalogs. This is a service provided by NASA HEASARC .
Facebook
TwitterThis study tests the efficacy of an intervention--Safe Public Spaces (SPS) -- focused on improving the safety of public spaces in schools, such as hallways, cafeterias, and stairwells. Twenty-four schools with middle grades in a large urban area were recruited for participation and were pair-matched and then assigned to either treatment or control. The study comprises four components: an implementation evaluation, a cost study, an impact study, and a community crime study. Community-crime-study: The community crime study used the arrest of juveniles from the NYPD (New York Police Department) data. The data can be found at (https://data.cityofnewyork.us/Public-Safety/NYPD-Arrests-Data-Historic-/8h9b-rp9u). Data include all arrest for the juvenile crime during the life of the intervention. The 12 matched schools were identified and geo-mapped using Quantum GIS (QGIS) 3.8 software. Block groups in the 2010 US Census in which the schools reside and neighboring block groups were mapped into micro-areas. This resulted in twelve experimental school blocks and 11 control blocks which the schools reside (two of the control schools existed in the same census block group). Additionally, neighboring blocks using were geo-mapped into 70 experimental and 77 control adjacent block groups (see map). Finally, juvenile arrests were mapped into experimental and control areas. Using the ARIMA time-series method in Stata 15 statistical software package, arrest data were analyzed to compare the change in juvenile arrests in the experimental and control sites. Cost-study: For the cost study, information from the implementing organization (Engaging Schools) was combined with data from phone conversations and follow-up communications with staff in school sites to populate a Resource Cost Model. The Resource Cost Model Excel file will be provided for archiving. This file contains details on the staff time and materials allocated to the intervention, as well as the NYC prices in 2018 US dollars associated with each element. Prices were gathered from multiple sources, including actual NYC DOE data on salaries for position types for which these data were available and district salary schedules for the other staff types. Census data were used to calculate benefits. Impact-evaluation: The impact evaluation was conducted using data from the Research Alliance for New York City Schools. Among the core functions of the Research Alliance is maintaining a unique archive of longitudinal data on NYC schools to support ongoing research. The Research Alliance builds and maintains an archive of longitudinal data about NYC schools. Their agreement with the New York City Department of Education (NYC DOE) outlines the data they receive, the process they use to obtain it, and the security measures to keep it safe. Implementation-study: The implementation study comprises the baseline survey and observation data. Interview transcripts are not archived.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Did We Solve the Problem? The objective of this analysis was to predict high streaming counts on Spotify and perform a detailed cluster analysis to understand user behavior. Here’s a summary of how we addressed each part of the objective:
Prediction of High Streaming Counts:
Implemented Multiple Models: We utilized several machine learning models including Decision Tree, Random Forest, Gradient Boosting, Support Vector Machine (SVM), and k-Nearest Neighbors (k-NN). Comparison and Evaluation: These models were evaluated based on classification metrics like accuracy, precision, recall, and F1-score. The Gradient Boosting and Random Forest models were found to be the most effective in predicting high streaming counts. Cluster Analysis:
K-means Clustering: We applied K-means clustering to segment users into three clusters based on their listening behavior. Detailed Characterization: Each cluster was analyzed to understand the distinct characteristics, such as average playtime, skip rate, offline usage, and shuffle usage. Visualizations: Histograms and scatter plots were used to visualize the distributions and relationships within each cluster. Results and Insights Effective Models: The Gradient Boosting and Random Forest models provided the highest accuracy and balanced performance for predicting high streaming counts. User Segmentation: The cluster analysis revealed three distinct user segments: Cluster 1: Users with longer playtimes and lower skip rates. Cluster 2: Users with moderate playtimes and skip rates. Cluster 3: Users with shorter playtimes and higher skip rates. These insights can be leveraged for targeted marketing, personalized recommendations, and improving user engagement on Spotify.
Conclusion Yes, we solved the problem. We successfully predicted high streaming counts using effective machine learning models and provided a detailed cluster analysis to understand user behavior. The analysis offers valuable insights for enhancing Spotify’s recommendation system and user experience.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We would like to inform you that the updated GlobPOP dataset (2021-2022) have been available in version 2.0. The GlobPOP dataset (2021-2022) in the current version is not recommended for your work. The GlobPOP dataset (1990-2020) in the current version is the same as version 1.0.
Thank you for your continued support of the GlobPOP.
If you encounter any issues, please contact us via email at lulingliu@mail.bnu.edu.cn.
Continuously monitoring global population spatial dynamics is essential for implementing effective policies related to sustainable development, such as epidemiology, urban planning, and global inequality.
Here, we present GlobPOP, a new continuous global gridded population product with a high-precision spatial resolution of 30 arcseconds from 1990 to 2020. Our data-fusion framework is based on cluster analysis and statistical learning approaches, which intends to fuse the existing five products(Global Human Settlements Layer Population (GHS-POP), Global Rural Urban Mapping Project (GRUMP), Gridded Population of the World Version 4 (GPWv4), LandScan Population datasets and WorldPop datasets to a new continuous global gridded population (GlobPOP). The spatial validation results demonstrate that the GlobPOP dataset is highly accurate. To validate the temporal accuracy of GlobPOP at the country level, we have developed an interactive web application, accessible at https://globpop.shinyapps.io/GlobPOP/, where data users can explore the country-level population time-series curves of interest and compare them with census data.
With the availability of GlobPOP dataset in both population count and population density formats, researchers and policymakers can leverage our dataset to conduct time-series analysis of population and explore the spatial patterns of population development at various scales, ranging from national to city level.
The product is produced in 30 arc-seconds resolution(approximately 1km in equator) and is made available in GeoTIFF format. There are two population formats, one is the 'Count'(Population count per grid) and another is the 'Density'(Population count per square kilometer each grid)
Each GeoTIFF filename has 5 fields that are separated by an underscore "_". A filename extension follows these fields. The fields are described below with the example filename:
GlobPOP_Count_30arc_1990_I32
Field 1: GlobPOP(Global gridded population)
Field 2: Pixel unit is population "Count" or population "Density"
Field 3: Spatial resolution is 30 arc seconds
Field 4: Year "1990"
Field 5: Data type is I32(Int 32) or F32(Float32)
Please refer to the paper for detailed information:
Liu, L., Cao, X., Li, S. et al. A 31-year (1990–2020) global gridded population dataset generated by cluster analysis and statistical learning. Sci Data 11, 124 (2024). https://doi.org/10.1038/s41597-024-02913-0.
The fully reproducible codes are publicly available at GitHub: https://github.com/lulingliu/GlobPOP.
Facebook
TwitterThis table, the Archive of Chandra Cluster Entropy Profile Tables (ACCEPT) Catalog, contains the radial entropy profiles of the intracluster medium (ICM) for a collection of 239 clusters taken from the Chandra X-ray Observatory's Data Archive. Entropy is of great interest because it controls ICM global properties and records the thermal history of a cluster. The authors find that most ICM entropy profiles are well fitted by a model which is a power law at large radii and approaches a constant value at small radii: K(r) = K0 + K100 (r/100 kpc)alpha, where K0 quantifies the typical excess of core entropy above the best-fitting power law found at larger radii. The authors also show that the K0 distributions of both the full archival sample and the primary Highest X-Ray Flux Galaxy Cluster Sample of Reiprich (2001, Ph.D. thesis) are bimodal with a distinct gap between K0 ~ 30 - 50 keV cm2 and population peaks at K0 ~ 15 keV cm2 and K0 ~ 150 keV cm2. The effects of point-spread function smearing and angular resolution on best-fit K0 values are investigated using mock Chandra observations and degraded entropy profiles, respectively. The authors find that neither of these effects is sufficient to explain the entropy-profile flattening they measure at small radii. The influence of profile curvature and the number of radial bins on the best-fit K0 is also considered, and they find no indication that K0 is significantly impacted by either. All data and results associated with this work are publicly available via the project web site http://www.pa.msu.edu/astro/MC2/accept/. The sample is collected from observations taken with the Chandra X-ray Observatory and which were publicly available in the CDA (Chandra Data Archive) as of 2008 August. This table was created by the HEASARC in January 2012 based on CDS Catalog J/ApJS/182/12 files table1.dat and table5.dat. This is a service provided by NASA HEASARC .
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.