100+ datasets found
  1. f

    Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

    • tandf.figshare.com
    tar
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

  2. C

    Cluster Analysis Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Cluster Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/cluster-analysis-software-59553
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.

  3. Bigdata with Ground Truth 4 K-Means Clustering

    • kaggle.com
    zip
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eihab SaatiAlSoruji (2024). Bigdata with Ground Truth 4 K-Means Clustering [Dataset]. https://www.kaggle.com/datasets/eihabsaatialsoruji/bigdata-with-ground-truth-4-k-means-clustering/suggestions
    Explore at:
    zip(600981775 bytes)Available download formats
    Dataset updated
    Nov 7, 2024
    Authors
    Eihab SaatiAlSoruji
    Description

    This dataset is used in the research entitled "Review on Designing High-Performance K-Means Clustering for Big Data Processing," which investigates big data clustering using various parallel K-means techniques. The dataset includes four sub-datasets, each representing a different scenario. Each scenario demonstrates a distinct distribution of data points within a 2-dimensional feature space, including the ground truth. Furthermore, each scenario contains four data files with varying sizes of data points that follow the same distribution: 100K, 1M, 4M, and 32M data points (where M = million, K = thousand). The figures provided in the scenarios illustrate sample data point distributions.

    Using this dataset is permitted when citing the previously mentioned paper after publication.

  4. MOESM1 of Limited random walk algorithm for big graph data clustering

    • springernature.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj (2023). MOESM1 of Limited random walk algorithm for big graph data clustering [Dataset]. http://doi.org/10.6084/m9.figshare.c.3696874_D1.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Clustering results on graphs used in the experiments of various methods.

  5. Customer Clustering

    • kaggle.com
    zip
    Updated May 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dev Sharma (2021). Customer Clustering [Dataset]. https://www.kaggle.com/datasets/dev0914sharma/customer-clustering/data
    Explore at:
    zip(26543 bytes)Available download formats
    Dataset updated
    May 7, 2021
    Authors
    Dev Sharma
    Description

    Customer Segmentation is the subdivision of a market into discrete customer groups that share similar characteristics. Customer Segmentation can be a powerful means to identify unsatisfied customer needs. Using the above data companies can then outperform the competition by developing uniquely appealing products and services. You are owing a supermarket mall and through membership cards, you have some basic data about your customers like Customer ID, age, gender, annual income and spending score. You want to understand the customers like who are the target customers so that the sense can be given to marketing team and plan the strategy accordingly.

  6. s

    Citation Trends for "Computer Network Information Security Threat...

    • shibatadb.com
    Updated Dec 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yubetsu (2022). Citation Trends for "Computer Network Information Security Threat Identification Technology Based on Big Data Clustering Algorithm" [Dataset]. https://www.shibatadb.com/article/k3tvBrme
    Explore at:
    Dataset updated
    Dec 2, 2022
    Dataset authored and provided by
    Yubetsu
    License

    https://www.shibatadb.com/license/data/proprietary/v1.0/license.txthttps://www.shibatadb.com/license/data/proprietary/v1.0/license.txt

    Time period covered
    2024
    Variables measured
    New Citations per Year
    Description

    Yearly citation counts for the publication titled "Computer Network Information Security Threat Identification Technology Based on Big Data Clustering Algorithm".

  7. f

    Data_Sheet_3_Qluster: An easy-to-implement generic workflow for robust...

    • frontiersin.figshare.com
    docx
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker (2023). Data_Sheet_3_Qluster: An easy-to-implement generic workflow for robust clustering of health data.docx [Dataset]. http://doi.org/10.3389/frai.2022.1055294.s003
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.

  8. C

    Clustering Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Sep 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Clustering Software Report [Dataset]. https://www.datainsightsmarket.com/reports/clustering-software-1976567
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Sep 23, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global Clustering Software market is poised for substantial growth, projected to reach approximately $15,000 million by 2025, with a robust Compound Annual Growth Rate (CAGR) of 15% anticipated from 2025 to 2033. This expansion is primarily fueled by the increasing demand for enhanced performance, reliability, and scalability across diverse enterprise IT infrastructures. Businesses are increasingly leveraging clustering solutions to achieve high availability for critical applications, optimize resource utilization, and enable seamless disaster recovery capabilities. The proliferation of big data analytics, AI/ML workloads, and the growing adoption of cloud-native architectures further amplify the need for sophisticated clustering software that can manage complex, distributed environments effectively. Small and medium-sized businesses, in particular, are recognizing the value proposition of clustering in democratizing access to enterprise-grade performance and resilience, thus driving adoption beyond large enterprises. The market dynamics are characterized by a strong upward trend in the adoption of Windows-based clustering solutions, driven by Microsoft's continued innovation in its server operating systems and clustering technologies. However, Linux and Unix-based solutions are also witnessing significant traction, especially within high-performance computing (HPC) environments and organizations with a strong open-source leaning. Restraints for the market include the complexity of initial setup and ongoing management for some advanced clustering configurations, as well as the upfront investment costs associated with robust hardware and software. Nevertheless, ongoing advancements in automated management tools, containerization technologies like Docker and Kubernetes, and the increasing availability of cloud-based managed clustering services are mitigating these challenges. Key players like IBM, Microsoft, Oracle, and Red Hat are continuously innovating, introducing advanced features, and expanding their partner ecosystems to capitalize on this burgeoning market. This report delves into the dynamic landscape of the global Clustering Software market, projecting a robust expansion from an estimated $15.5 billion in 2025 to a substantial $32.7 billion by 2033. The study meticulously analyzes the Historical Period (2019-2024), providing a foundation for understanding current market dynamics, with a focus on the Base Year (2025) and an extensive Forecast Period (2025-2033). Through rigorous analysis of industry developments, technological advancements, and evolving market needs, this report offers unparalleled insights for stakeholders seeking to navigate and capitalize on this critical technology segment.

  9. m

    Data for: 3652350

    • data.mendeley.com
    Updated Jul 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Krishnakanth Allika (2020). Data for: 3652350 [Dataset]. http://doi.org/10.17632/9yj9d4dsnf.1
    Explore at:
    Dataset updated
    Jul 15, 2020
    Authors
    Krishnakanth Allika
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Geospatial_Coordinates.csv [Postal code, latitude & longitude of data points in Toronto] FourSquareCategories.json [Categories and category IDs of FourSquare API] Processed_data_for_analysis.csv [Data file post data preparation and available for analysis]

  10. d

    Data from: A Generic Local Algorithm for Mining Data Streams in Large...

    • catalog.data.gov
    • datasets.ai
    • +2more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  11. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    pdf
    Updated Feb 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Feb 8, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    License

    https://www.technavio.com/content/privacy-noticehttps://www.technavio.com/content/privacy-notice

    Time period covered
    2025 - 2029
    Area covered
    United States
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is valued to increase USD 763.9 million, at a CAGR of 40.2% from 2024 to 2029. Integration of AI and ML technologies with data science platforms will drive the data science platform market.

    Major Market Trends & Insights

    North America dominated the market and accounted for a 48% growth during the forecast period.
    By Deployment - On-premises segment was valued at USD 38.70 million in 2023
    By Component - Platform segment accounted for the largest market revenue share in 2023
    

    Market Size & Forecast

    Market Opportunities: USD 1.00 million
    Market Future Opportunities: USD 763.90 million
    CAGR : 40.2%
    North America: Largest market in 2023
    

    Market Summary

    The market represents a dynamic and continually evolving landscape, underpinned by advancements in core technologies and applications. Key technologies, such as machine learning and artificial intelligence, are increasingly integrated into data science platforms to enhance predictive analytics and automate data processing. Additionally, the emergence of containerization and microservices in data science platforms enables greater flexibility and scalability. However, the market also faces challenges, including data privacy and security risks, which necessitate robust compliance with regulations.
    According to recent estimates, the market is expected to account for over 30% of the overall big data analytics market by 2025, underscoring its growing importance in the data-driven business landscape.
    

    What will be the Size of the Data Science Platform Market during the forecast period?

    Get Key Insights on Market Forecast (PDF) Request Free Sample

    How is the Data Science Platform Market Segmented and what are the key trends of market segmentation?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments.

    Deployment
    
      On-premises
      Cloud
    
    
    Component
    
      Platform
      Services
    
    
    End-user
    
      BFSI
      Retail and e-commerce
      Manufacturing
      Media and entertainment
      Others
    
    
    Sector
    
      Large enterprises
      SMEs
    
    
    Application
    
      Data Preparation
      Data Visualization
      Machine Learning
      Predictive Analytics
      Data Governance
      Others
    
    
    Geography
    
      North America
    
        US
        Canada
    
    
      Europe
    
        France
        Germany
        UK
    
    
      Middle East and Africa
    
        UAE
    
    
      APAC
    
        China
        India
        Japan
    
    
      South America
    
        Brazil
    
    
      Rest of World (ROW)
    

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.

    In the dynamic and evolving the market, big data processing is a key focus, enabling advanced model accuracy metrics through various data mining methods. Distributed computing and algorithm optimization are integral components, ensuring efficient handling of large datasets. Data governance policies are crucial for managing data security protocols and ensuring data lineage tracking. Software development kits, model versioning, and anomaly detection systems facilitate seamless development, deployment, and monitoring of predictive modeling techniques, including machine learning algorithms, regression analysis, and statistical modeling. Real-time data streaming and parallelized algorithms enable real-time insights, while predictive modeling techniques and machine learning algorithms drive business intelligence and decision-making.

    Cloud computing infrastructure, data visualization tools, high-performance computing, and database management systems support scalable data solutions and efficient data warehousing. ETL processes and data integration pipelines ensure data quality assessment and feature engineering techniques. Clustering techniques and natural language processing are essential for advanced data analysis. The market is witnessing significant growth, with adoption increasing by 18.7% in the past year, and industry experts anticipate a further expansion of 21.6% in the upcoming period. Companies across various sectors are recognizing the potential of data science platforms, leading to a surge in demand for scalable, secure, and efficient solutions.

    API integration services and deep learning frameworks are gaining traction, offering advanced capabilities and seamless integration with existing systems. Data security protocols and model explainability methods are becoming increasingly important, ensuring transparency and trust in data-driven decision-making. The market is expected to continue unfolding, with ongoing advancements in technology and evolving business needs shaping its future trajectory.

    Request Free Sample

    The On-premises segment was valued at USD 38.70 million in 2019 and showed

  12. w

    Global Clustering Software Market Research Report: By Application (Data...

    • wiseguyreports.com
    Updated Oct 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Global Clustering Software Market Research Report: By Application (Data Mining, Machine Learning, Image Processing, Natural Language Processing), By Deployment Type (On-Premises, Cloud-Based, Hybrid), By End User (BFSI, Healthcare, Retail, Telecommunications), By Organization Size (Small Enterprises, Medium Enterprises, Large Enterprises) and By Regional (North America, Europe, South America, Asia Pacific, Middle East and Africa) - Forecast to 2035 [Dataset]. https://www.wiseguyreports.com/reports/clustering-software-market
    Explore at:
    Dataset updated
    Oct 14, 2025
    License

    https://www.wiseguyreports.com/pages/privacy-policyhttps://www.wiseguyreports.com/pages/privacy-policy

    Time period covered
    Oct 25, 2025
    Area covered
    Global
    Description
    BASE YEAR2024
    HISTORICAL DATA2019 - 2023
    REGIONS COVEREDNorth America, Europe, APAC, South America, MEA
    REPORT COVERAGERevenue Forecast, Competitive Landscape, Growth Factors, and Trends
    MARKET SIZE 20242397.5(USD Million)
    MARKET SIZE 20252538.9(USD Million)
    MARKET SIZE 20354500.0(USD Million)
    SEGMENTS COVEREDApplication, Deployment Type, End User, Organization Size, Regional
    COUNTRIES COVEREDUS, Canada, Germany, UK, France, Russia, Italy, Spain, Rest of Europe, China, India, Japan, South Korea, Malaysia, Thailand, Indonesia, Rest of APAC, Brazil, Mexico, Argentina, Rest of South America, GCC, South Africa, Rest of MEA
    KEY MARKET DYNAMICSincreasing big data adoption, rising demand for advanced analytics, growing need for real-time insights, expansion of cloud computing, integration of AI technologies
    MARKET FORECAST UNITSUSD Million
    KEY COMPANIES PROFILEDTableau, Qlik, SAS Institute, MathWorks, SAP, Google Cloud, Knime, TIBCO Software, Microsoft, H2O.ai, Alteryx, IBM, AWS, databricks, Oracle, RapidMiner
    MARKET FORECAST PERIOD2025 - 2035
    KEY MARKET OPPORTUNITIESAI-driven data analysis, Cloud-based clustering solutions, Integration with IoT devices, Real-time data processing, Enhanced cybersecurity features
    COMPOUND ANNUAL GROWTH RATE (CAGR) 5.9% (2025 - 2035)
  13. Introduction to Clustering | Cluster Analysis

    • kaggle.com
    zip
    Updated Jul 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science (2018). Introduction to Clustering | Cluster Analysis [Dataset]. https://www.kaggle.com/ravali566/introduction-to-clustering-cluster-analysis
    Explore at:
    zip(16419686 bytes)Available download formats
    Dataset updated
    Jul 2, 2018
    Authors
    Data Science
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Data Science

    Released under CC0: Public Domain

    Contents

  14. Working with new clusters vs. usage of same cluster globally 2023

    • statista.com
    Updated Jul 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Working with new clusters vs. usage of same cluster globally 2023 [Dataset]. https://www.statista.com/statistics/1451639/creation-of-new-clusters/
    Explore at:
    Dataset updated
    Jul 8, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    In 2023, the majority of respondents worldwide reported that they work without a dedicated cluster, with a share of almost ** percent of those surveyed reporting the same. Only ** percent reported that they create a new cluster for each development task.

  15. Reference list of 265 sources used for the discovery of relationships...

    • doi.pangaea.de
    • search.dataone.org
    Updated Jul 8, 2012
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer (2012). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
    Explore at:
    Dataset updated
    Jul 8, 2012
    Dataset provided by
    PANGAEA
    Authors
    Jürgen Bernard; Tobias Ruppert; Tobias Schreck; Maximilian Scherer; Jörn Kohlhammer
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2006 - Dec 31, 2006
    Area covered
    Description

    Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.

  16. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  17. K-Mean Clustering Algorithm | Cluster Analysis

    • kaggle.com
    zip
    Updated Jun 4, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Science (2018). K-Mean Clustering Algorithm | Cluster Analysis [Dataset]. https://www.kaggle.com/ravali566/kmean-clustering-algorithm-cluster-analysis
    Explore at:
    zip(9030596 bytes)Available download formats
    Dataset updated
    Jun 4, 2018
    Authors
    Data Science
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Dataset

    This dataset was created by Data Science

    Released under CC0: Public Domain

    Contents

  18. Clustering results of real datasets.

    • plos.figshare.com
    xls
    Updated Jun 5, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wei Xingqiong; Li Kang (2025). Clustering results of real datasets. [Dataset]. http://doi.org/10.1371/journal.pone.0325161.t004
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 5, 2025
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Wei Xingqiong; Li Kang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Clustering is a fundamental tool in data mining, widely used in various fields such as image segmentation, data science, pattern recognition, and bioinformatics. Density Peak Clustering (DPC) is a density-based method that identifies clusters by calculating the local density of data points and selecting cluster centers based on these densities. However, DPC has several limitations. First, it requires a cutoff distance to calculate local density, and this parameter varies across datasets, which requires manual tuning and affects the algorithm’s performance. Second, the number of cluster centers must be manually specified, as the algorithm cannot automatically determine the optimal number of clusters, making the algorithm dependent on human intervention. To address these issues, we propose an adaptive Density Peak Clustering (DPC) method, which automatically adjusts parameters like cutoff distance and the number of clusters, based on the Delaunay graph. This approach uses the Delaunay graph to calculate the connectivity between data points and prunes the points based on these connections, automatically determining the number of cluster centers. Additionally, by optimizing clustering indices, the algorithm automatically adjusts its parameters, enabling clustering without any manual input. Experimental results on both synthetic and real-world datasets demonstrate that the proposed algorithm outperforms similar methods in terms of both efficiency and clustering accuracy.

  19. d

    Model-based cluster analysis of microarray gene-expression data

    • catalog.data.gov
    • data.virginia.gov
    • +1more
    Updated Sep 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Institutes of Health (2025). Model-based cluster analysis of microarray gene-expression data [Dataset]. https://catalog.data.gov/dataset/model-based-cluster-analysis-of-microarray-gene-expression-data
    Explore at:
    Dataset updated
    Sep 7, 2025
    Dataset provided by
    National Institutes of Health
    Description

    Background Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray gene-expression data. However, normal mixture model-based cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster gene-expression patterns but a summary statistic, the t-statistic. Results The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middle-ear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered gene-expression levels, whereas the third one has 30 genes with more or less differential gene-expression levels. Conclusions Our results indicate that model-based clustering of t-statistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.

  20. U

    Unsupervised Learning Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    Discover the booming unsupervised learning market! Projected at $15 billion in 2025 and growing at a 25% CAGR, this report analyzes market drivers, trends, and key players like Microsoft & Google. Explore regional breakdowns and future forecasts (2025-2033).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Related Article
Explore at:
tarAvailable download formats
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu