100+ datasets found

f
Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...
tandf.figshare.com
tar
Updated Jun 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
Explore at:
tarAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.25594361.v1
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.
C
Cluster Analysis Software Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Cluster Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/cluster-analysis-software-59553
Explore at:
pdf, doc, pptAvailable download formats
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.
Bigdata with Ground Truth 4 K-Means Clustering
kaggle.com
zip
Updated Nov 7, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eihab SaatiAlSoruji (2024). Bigdata with Ground Truth 4 K-Means Clustering [Dataset]. https://www.kaggle.com/datasets/eihabsaatialsoruji/bigdata-with-ground-truth-4-k-means-clustering/suggestions
Explore at:
zip(600981775 bytes)Available download formats
Dataset updated
Nov 7, 2024
Authors
Eihab SaatiAlSoruji
Description
This dataset is used in the research entitled "Review on Designing High-Performance K-Means Clustering for Big Data Processing," which investigates big data clustering using various parallel K-means techniques. The dataset includes four sub-datasets, each representing a different scenario. Each scenario demonstrates a distinct distribution of data points within a 2-dimensional feature space, including the ground truth. Furthermore, each scenario contains four data files with varying sizes of data points that follow the same distribution: 100K, 1M, 4M, and 32M data points (where M = million, K = thousand). The figures provided in the scenarios illustrate sample data point distributions.

Using this dataset is permitted when citing the previously mentioned paper after publication.
MOESM1 of Limited random walk algorithm for big graph data clustering
springernature.figshare.com
zip
Updated Jun 1, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj (2023). MOESM1 of Limited random walk algorithm for big graph data clustering [Dataset]. http://doi.org/10.6084/m9.figshare.c.3696874_D1.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.c.3696874_D1.v1
Dataset updated
Jun 1, 2023
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Additional file 1. Clustering results on graphs used in the experiments of various methods.
Customer Clustering
kaggle.com
zip
Updated May 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dev Sharma (2021). Customer Clustering [Dataset]. https://www.kaggle.com/datasets/dev0914sharma/customer-clustering/data
Explore at:
zip(26543 bytes)Available download formats
Dataset updated
May 7, 2021
Authors
Dev Sharma
Description
Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services. You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.
s
Citation Trends for "Computer Network Information Security Threat...
shibatadb.com
Updated Dec 2, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yubetsu (2022). Citation Trends for "Computer Network Information Security Threat Identification Technology Based on Big Data Clustering Algorithm" [Dataset]. https://www.shibatadb.com/article/k3tvBrme
Explore at:
Dataset updated
Dec 2, 2022
Dataset authored and provided by
Yubetsu
License
https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt
Time period covered
2024
Variables measured
New Citations per Year
Description
Yearly citation counts for the publication titled "Computer Network Information Security Threat Identification Technology Based on Big Data Clustering Algorithm".
f
Data_Sheet_3_Qluster: An easy-to-implement generic workflow for robust...
frontiersin.figshare.com
docx
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker (2023). Data_Sheet_3_Qluster: An easy-to-implement generic workflow for robust clustering of health data.docx [Dataset]. http://doi.org/10.3389/frai.2022.1055294.s003
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2022.1055294.s003
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
C
Clustering Software Report
datainsightsmarket.com
doc, pdf, ppt
Updated Sep 23, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Insights Market (2025). Clustering Software Report [Dataset]. https://www.datainsightsmarket.com/reports/clustering-software-1976567
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Sep 23, 2025
Dataset authored and provided by
Data Insights Market
License
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The global Clustering Software market is poised for substantial growth, projected to reach approximately $15,000 million by 2025, with a robust Compound Annual Growth Rate (CAGR) of 15% anticipated from 2025 to 2033. This expansion is primarily fueled by the increasing demand for enhanced performance, reliability, and scalability across diverse enterprise IT infrastructures. Businesses are increasingly leveraging clustering solutions to achieve high availability for critical applications, optimize resource utilization, and enable seamless disaster recovery capabilities. The proliferation of big data analytics, AI/ML workloads, and the growing adoption of cloud-native architectures further amplify the need for sophisticated clustering software that can manage complex, distributed environments effectively. Small and medium-sized businesses, in particular, are recognizing the value proposition of clustering in democratizing access to enterprise-grade performance and resilience, thus driving adoption beyond large enterprises. The market dynamics are characterized by a strong upward trend in the adoption of Windows-based clustering solutions, driven by Microsoft's continued innovation in its server operating systems and clustering technologies. However, Linux and Unix-based solutions are also witnessing significant traction, especially within high-performance computing (HPC) environments and organizations with a strong open-source leaning. Restraints for the market include the complexity of initial setup and ongoing management for some advanced clustering configurations, as well as the upfront investment costs associated with robust hardware and software. Nevertheless, ongoing advancements in automated management tools, containerization technologies like Docker and Kubernetes, and the increasing availability of cloud-based managed clustering services are mitigating these challenges. Key players like IBM, Microsoft, Oracle, and Red Hat are continuously innovating, introducing advanced features, and expanding their partner ecosystems to capitalize on this burgeoning market. This report delves into the dynamic landscape of the global Clustering Software market, projecting a robust expansion from an estimated $15.5 billion in 2025 to a substantial $32.7 billion by 2033. The study meticulously analyzes the Historical Period (2019-2024), providing a foundation for understanding current market dynamics, with a focus on the Base Year (2025) and an extensive Forecast Period (2025-2033). Through rigorous analysis of industry developments, technological advancements, and evolving market needs, this report offers unparalleled insights for stakeholders seeking to navigate and capitalize on this critical technology segment.
m
Data for: 3652350
data.mendeley.com
Updated Jul 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Krishnakanth Allika (2020). Data for: 3652350 [Dataset]. http://doi.org/10.17632/9yj9d4dsnf.1
Explore at:
Unique identifier
https://doi.org/10.17632/9yj9d4dsnf.1
Dataset updated
Jul 15, 2020
Authors
Krishnakanth Allika
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Geospatial_Coordinates.csv [Postal code, latitude & longitude of data points in Toronto] FourSquareCategories.json [Categories and category IDs of FourSquare API] Processed_data_for_analysis.csv [Data file post data preparation and available for analysis]
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+2more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...
technavio.com
pdf
Updated Feb 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
Explore at:
pdfAvailable download formats
Dataset updated
Feb 8, 2025
Dataset provided by
TechNavio
Authors
Technavio
License
https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice
Time period covered
2025 - 2029
Area covered
United States
Description
Snapshot img

Data Science Platform Market Size 2025-2029

The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

Major Market Trends & Insights

North America dominated the market and accounted for a 48% growth during the forecast period. By Deployment - On-premises segment was valued at USD 38.70 million in 2023 By Component - Platform segment accounted for the largest market revenue share in 2023

Market Size & Forecast

Market Opportunities: USD 1.00 million Market Future Opportunities: USD 763.90 million CAGR : 40.2% North America: Largest market in 2023

Market Summary

The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations. According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.

What will be the Size of the Data Science Platform Market during the forecast period?

Get Key Insights on Market Forecast (PDF) Request Free Sample

How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

Deployment On-premises Cloud Component Platform Services End-user BFSI Retail and e-commerce Manufacturing Media and entertainment Others Sector Large enterprises SMEs Application Data Preparation Data Visualization Machine Learning Predictive Analytics Data Governance Others Geography North America US Canada Europe France Germany UK Middle East and Africa UAE APAC China India Japan South America Brazil Rest of World (ROW)

By Deployment Insights

The on-premises segment is estimated to witness significant growth during the forecast period.

In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

Request Free Sample

The On-premises segment was valued at USD 38.70 million in 2019 and showed

Global Clustering Software Market Research Report: By Application (Data...

wiseguyreports.com

Updated Oct 14, 2025

Facebook

Twitter

Click to copy link

Link copied

Cite

(2025). Global Clustering Software Market Research Report: By Application (Data Mining, Machine Learning, Image Processing, Natural Language Processing), By Deployment Type (On-Premises, Cloud-Based, Hybrid), By End User (BFSI, Healthcare, Retail, Telecommunications), By Organization Size (Small Enterprises, Medium Enterprises, Large Enterprises) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/clustering-software-market

Explore at:

Dataset updated

Oct 14, 2025

License

https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

Time period covered

Oct 25, 2025

Area covered

Global

Description

BASE YEAR	2024
HISTORICAL DATA	2019 - 2023
REGIONS COVERED	North America, Europe, APAC, South America, MEA
REPORT COVERAGE	Revenue Forecast, Competitive Landscape, Growth Factors, and Trends
MARKET SIZE 2024	2397.5(USD Million)
MARKET SIZE 2025	2538.9(USD Million)
MARKET SIZE 2035	4500.0(USD Million)
SEGMENTS COVERED	Application, Deployment Type, End User, Organization Size, Regional
COUNTRIES COVERED	US, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
KEY MARKET DYNAMICS	increasing big data adoption, rising demand for advanced analytics, growing need for real-time insights, expansion of cloud computing, integration of AI technologies
MARKET FORECAST UNITS	USD Million
KEY COMPANIES PROFILED	Tableau, Qlik, SAS Institute, MathWorks, SAP, Google Cloud, Knime, TIBCO Software, Microsoft, H2O.ai, Alteryx, IBM, AWS, databricks, Oracle, RapidMiner
MARKET FORECAST PERIOD	2025 - 2035
KEY MARKET OPPORTUNITIES	AI-driven data analysis, Cloud-based clustering solutions, Integration with IoT devices, Real-time data processing, Enhanced cybersecurity features
COMPOUND ANNUAL GROWTH RATE (CAGR)	5.9% (2025 - 2035)

Introduction to Clustering | Cluster Analysis
kaggle.com
zip
Updated Jul 2, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science (2018). Introduction to Clustering | Cluster Analysis [Dataset]. https://www.kaggle.com/ravali566/introduction-to-clustering-cluster-analysis
Explore at:
zip(16419686 bytes)Available download formats
Dataset updated
Jul 2, 2018
Authors
Data Science
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Data Science

Released under CC0: Public Domain

Contents
Working with new clusters vs. usage of same cluster globally 2023
statista.com
Updated Jul 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Working with new clusters vs. usage of same cluster globally 2023 [Dataset]. https://www.statista.com/statistics/1451639/creation-of-new-clusters/
Explore at:
Dataset updated
Jul 8, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Time period covered
2023
Area covered
Worldwide
Description
In 2023, the majority of respondents worldwide reported that they work without a dedicated cluster, with a share of almost ** percent of those surveyed reporting the same. Only ** percent reported that they create a new cluster for each development task.
Reference list of 265 sources used for the discovery of relationships...
doi.pangaea.de
search.dataone.org
Updated Jul 8, 2012
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer (2012). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
Explore at:
Unique identifier
https://doi.org/10.1594/PANGAEA.785666
Dataset updated
Jul 8, 2012
Dataset provided by
PANGAEA
Authors
Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Time period covered
Jan 1, 2006 - Dec 31, 2006
Area covered
Description
Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
K-Mean Clustering Algorithm | Cluster Analysis
kaggle.com
zip
Updated Jun 4, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Data Science (2018). K-Mean Clustering Algorithm | Cluster Analysis [Dataset]. https://www.kaggle.com/ravali566/kmean-clustering-algorithm-cluster-analysis
Explore at:
zip(9030596 bytes)Available download formats
Dataset updated
Jun 4, 2018
Authors
Data Science
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset

This dataset was created by Data Science

Released under CC0: Public Domain

Contents
Clustering results of real datasets.
plos.figshare.com
xls
Updated Jun 5, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wei Xingqiong; Li Kang (2025). Clustering results of real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t004
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0325161.t004
Dataset updated
Jun 5, 2025
Dataset provided by
PLOShttp://plos.org/
Authors
Wei Xingqiong; Li Kang
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.
d
Model-based cluster analysis of microarray gene-expression data
catalog.data.gov
data.virginia.gov
+1more
Updated Sep 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Institutes of Health (2025). Model-based cluster analysis of microarray gene-expression data [Dataset]. https://catalog.data.gov/dataset/model-based-cluster-analysis-of-microarray-gene-expression-data
Explore at:
Dataset updated
Sep 7, 2025
Dataset provided by
National Institutes of Health
Description
Background Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. Results The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Conclusions Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
U
Unsupervised Learning Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
Discover the booming unsupervised learning market! Projected at $15 billion in 2025 and growing at a 25% CAGR, this report analyzes market drivers, trends, and key players like Microsoft & Google. Explore regional breakdowns and future forecasts (2025-2033).

Facebook

Twitter

Click to copy link

Link copied

Cite

Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Explore at:

tarAvailable download formats

Unique identifier

https://doi.org/10.6084/m9.figshare.25594361.v1

Dataset updated

Jun 11, 2024

Dataset provided by

Taylor & Francis

Authors

Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

Clear search

Close search

Google apps

Main menu

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

Cluster Analysis Software Report

Bigdata with Ground Truth 4 K-Means Clustering

MOESM1 of Limited random walk algorithm for big graph data clustering

Customer Clustering

Citation Trends for "Computer Network Information Security Threat...

Data_Sheet_3_Qluster: An easy-to-implement generic workflow for robust...

Clustering Software Report

Data for: 3652350

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

Snapshot img

Global Clustering Software Market Research Report: By Application (Data...

Introduction to Clustering | Cluster Analysis

Dataset

Contents

Working with new clusters vs. usage of same cluster globally 2023

Reference list of 265 sources used for the discovery of relationships...

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

K-Mean Clustering Algorithm | Cluster Analysis

Dataset

Contents

Clustering results of real datasets.

Model-based cluster analysis of microarray gene-expression data

Unsupervised Learning Report

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure