99 datasets found
  1. f

    Data from: Data Nuggets: A Method for Reducing Big Data While Preserving...

    • tandf.figshare.com
    tar
    Updated Jun 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1
    Explore at:
    tarAvailable download formats
    Dataset updated
    Jun 11, 2024
    Dataset provided by
    Taylor & Francis
    Authors
    Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

  2. Cluster Analysis Software Market Report | Global Forecast From 2025 To 2033

    • dataintelo.com
    csv, pdf, pptx
    Updated Jan 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dataintelo (2025). Cluster Analysis Software Market Report | Global Forecast From 2025 To 2033 [Dataset]. https://dataintelo.com/report/cluster-analysis-software-market
    Explore at:
    pptx, csv, pdfAvailable download formats
    Dataset updated
    Jan 7, 2025
    Dataset authored and provided by
    Dataintelo
    License

    https://dataintelo.com/privacy-and-policyhttps://dataintelo.com/privacy-and-policy

    Time period covered
    2024 - 2032
    Area covered
    Global
    Description

    Cluster Analysis Software Market Outlook



    The global Cluster Analysis Software market size was estimated to be USD 1.5 billion in 2023 and is projected to reach USD 3.8 billion by 2032, growing at a CAGR of 11.2% during the forecast period. The rapid adoption of data-driven decision-making processes, increasing volumes of data, and the necessity for advanced analytical tools are significantly driving this growth in market size.



    One of the primary growth factors for the Cluster Analysis Software market is the exponential increase in data generation across various industries. Businesses are increasingly recognizing the value of data analytics in extracting actionable insights to drive strategic decisions. This reliance on data has led to the growing adoption of cluster analysis software, which helps organizations categorize and interpret complex datasets efficiently. With the proliferation of IoT devices, social media interactions, and digital transactions, the volume of data is expected to continue its upward trajectory, thereby boosting the demand for such advanced analytical tools.



    Another key driver is the technological advancements in artificial intelligence and machine learning. These technologies have enhanced the capabilities of cluster analysis software, making them more efficient, accurate, and user-friendly. The integration of AI and ML algorithms allows for more sophisticated data clustering, enabling businesses to identify patterns and trends that were previously undetectable. As these technologies continue to evolve, the software is expected to become even more powerful, further fueling market growth.



    The increasing need for personalized customer experiences is also contributing to the market expansion. Retail and e-commerce sectors, in particular, are leveraging cluster analysis software to understand consumer behavior, preferences, and purchasing patterns. This enables them to tailor their marketing strategies, improve customer engagement, and boost sales. Similarly, the healthcare industry is utilizing these tools to enhance patient care by identifying disease patterns, predicting outbreaks, and optimizing treatment plans.



    In the realm of data analytics, High Availability Cluster Software plays a pivotal role in ensuring that critical applications remain operational and accessible, even in the event of hardware failures or other disruptions. This type of software is designed to manage a group of interconnected computers that work together to maintain high levels of uptime and reliability. By distributing workloads across multiple servers, High Availability Cluster Software minimizes the risk of downtime, which is crucial for businesses that rely heavily on real-time data processing and analysis. As organizations increasingly depend on data-driven insights to make strategic decisions, the demand for robust and resilient cluster solutions is on the rise. This trend is particularly evident in industries such as finance, healthcare, and e-commerce, where uninterrupted access to data is essential for maintaining competitive advantage.



    Regionally, North America holds the largest share of the Cluster Analysis Software market, driven by the presence of major technology companies and extensive adoption of advanced analytics across various industries. However, the Asia Pacific region is expected to witness the highest growth rate during the forecast period. The rapid digital transformation, increasing investments in big data and analytics, and the growing number of SMEs adopting these solutions are key factors contributing to this growth. Europe, Latin America, and the Middle East & Africa also show promising potential, albeit at a comparatively moderate growth pace.



    Component Analysis



    The Cluster Analysis Software market can be segmented into Software and Services. The Software segment encompasses various types of cluster analysis tools and platforms that organizations use to analyze large datasets. This segment is expected to dominate the market during the forecast period due to the increasing need for advanced analytics and data-driven decision-making processes. The software solutions are continuously evolving, offering more sophisticated features such as real-time data processing, AI integration, and improved user interfaces. As businesses strive to harness the full potential of their data, the demand for these advanced software solutions is projected to grow significantly.



    <p&g

  3. Bigdata with Ground Truth 4 K-Means Clustering

    • kaggle.com
    Updated Nov 7, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eihab SaatiAlSoruji (2024). Bigdata with Ground Truth 4 K-Means Clustering [Dataset]. https://www.kaggle.com/datasets/eihabsaatialsoruji/bigdata-with-ground-truth-4-k-means-clustering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 7, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Eihab SaatiAlSoruji
    Description

    This dataset is used in the research entitled "Review on Designing High-Performance K-Means Clustering for Big Data Processing," which investigates big data clustering using various parallel K-means techniques. The dataset includes four sub-datasets, each representing a different scenario. Each scenario demonstrates a distinct distribution of data points within a 2-dimensional feature space, including the ground truth. Furthermore, each scenario contains four data files with varying sizes of data points that follow the same distribution: 100K, 1M, 4M, and 32M data points (where M = million, K = thousand). The figures provided in the scenarios illustrate sample data point distributions.

    Using this dataset is permitted when citing the previously mentioned paper after publication.

  4. C

    Clustering Software Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated May 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Clustering Software Report [Dataset]. https://www.datainsightsmarket.com/reports/clustering-software-1978471
    Explore at:
    doc, pdf, pptAvailable download formats
    Dataset updated
    May 15, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global clustering software market is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for efficient data management across various industries. The market, estimated at $10 billion in 2025, is projected to maintain a healthy Compound Annual Growth Rate (CAGR) of 15% throughout the forecast period (2025-2033). This expansion is fueled by several key factors. The rising volume and complexity of data necessitate sophisticated clustering algorithms to extract meaningful insights, a crucial requirement for businesses aiming to improve operational efficiency, enhance customer experiences, and gain a competitive edge. Furthermore, the increasing adoption of cloud-based solutions and advancements in machine learning algorithms are accelerating the market's growth. Small and medium-sized businesses (SMBs) are increasingly adopting clustering software to streamline operations and leverage data-driven decision-making, while large enterprises are deploying it for complex analytics initiatives like fraud detection and customer segmentation. The prevalence of Windows operating systems in the enterprise sector continues to drive demand, although Linux and Unix-based solutions are gaining traction due to their scalability and cost-effectiveness. However, the market faces certain restraints, including the high initial investment costs associated with implementing and maintaining clustering software and the need for specialized technical expertise. Despite these challenges, the long-term outlook for the clustering software market remains highly promising, with continuous innovation in algorithm development and software integration expected to drive sustained growth. The competitive landscape is characterized by a mix of established players and emerging technology firms. Key players like HP, IBM, Microsoft, Oracle, and VMware dominate the market, leveraging their existing infrastructure and expertise to offer comprehensive clustering solutions. However, specialized companies and open-source initiatives are also contributing significantly to innovation and providing cost-effective alternatives. Regional variations exist, with North America and Europe currently holding the largest market share due to high technological adoption rates and established IT infrastructure. However, rapid digitalization in the Asia-Pacific region, particularly in countries like China and India, is expected to fuel significant market growth in the coming years. The market segmentation by application (SMBs, Enterprises) and operating system (Windows, Linux, Unix) allows for targeted product development and marketing strategies, facilitating sustained growth within specific niches. Future growth will depend on the successful integration of clustering software with other advanced analytics technologies, such as artificial intelligence and deep learning.

  5. MOESM1 of Limited random walk algorithm for big graph data clustering

    • springernature.figshare.com
    zip
    Updated Jun 1, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj (2023). MOESM1 of Limited random walk algorithm for big graph data clustering [Dataset]. http://doi.org/10.6084/m9.figshare.c.3696874_D1.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 1, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Honglei Zhang; Jenni Raitoharju; Serkan Kiranyaz; Moncef Gabbouj
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Additional file 1. Clustering results on graphs used in the experiments of various methods.

  6. C

    Cluster Analysis Software Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 15, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Cluster Analysis Software Report [Dataset]. https://www.archivemarketresearch.com/reports/cluster-analysis-software-59553
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Mar 15, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The global market for Cluster Analysis Software is experiencing robust growth, driven by the increasing adoption of big data analytics and the need for advanced data interpretation across diverse sectors. While precise market sizing data is unavailable, considering the growth observed in related fields like data analytics and AI, a reasonable estimate for the 2025 market size could be placed between $2.5 billion and $3 billion. This estimate assumes a moderate growth trajectory reflecting the maturation of the cluster analysis market and the ongoing integration of these tools into broader business intelligence platforms. Assuming a Compound Annual Growth Rate (CAGR) of 15% for the forecast period (2025-2033), the market is projected to reach a substantial size within the next decade. This growth is fueled by several key drivers, including the expanding availability of large datasets, the growing demand for data-driven decision-making across industries like BFSI (Banking, Financial Services, and Insurance), government, and commercial sectors, and the continuous development of more sophisticated algorithms and user-friendly interfaces for cluster analysis software. The cloud-based segment is expected to dominate, given its scalability and accessibility benefits, although web-based applications will continue to hold a significant market share. Geographic growth will be diverse, with North America and Europe maintaining strong positions due to advanced analytics adoption, but significant expansion is also expected in the Asia-Pacific region as technological advancement and data infrastructure improve. However, challenges like data privacy concerns, the need for skilled professionals, and the high cost of advanced software solutions could act as market restraints in certain regions. The competitive landscape is marked by a mix of established players such as IBM, Microsoft, and TIBCO Software, along with a growing number of specialized vendors and emerging technology companies. The market is characterized by ongoing innovation in areas like algorithm development, enhanced visualization capabilities, and the integration of cluster analysis with other advanced analytics tools. This continuous innovation will be a key driver in sustaining the market's high CAGR and ensuring its continued growth in the coming years. Increased focus on providing tailored solutions for specific industry verticals will likely be a strategic advantage for vendors seeking a competitive edge. The market's future hinges on its ability to effectively address the challenges of data complexity, security, and user-friendliness while continuing to deliver accurate and actionable insights.

  7. f

    Data_Sheet_1_Qluster: An easy-to-implement generic workflow for robust...

    • figshare.com
    xlsx
    Updated May 31, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker (2023). Data_Sheet_1_Qluster: An easy-to-implement generic workflow for robust clustering of health data.xlsx [Dataset]. http://doi.org/10.3389/frai.2022.1055294.s001
    Explore at:
    xlsxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Frontiers
    Authors
    Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.

  8. Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North...

    • technavio.com
    Updated Feb 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Technavio (2025). Data Science Platform Market Analysis, Size, and Forecast 2025-2029: North America (US and Canada), Europe (France, Germany, UK), APAC (China, India, Japan), South America (Brazil), and Middle East and Africa (UAE) [Dataset]. https://www.technavio.com/report/data-science-platform-market-industry-analysis
    Explore at:
    Dataset updated
    Feb 15, 2025
    Dataset provided by
    TechNavio
    Authors
    Technavio
    Time period covered
    2021 - 2025
    Area covered
    Canada, United States, Global
    Description

    Snapshot img

    Data Science Platform Market Size 2025-2029

    The data science platform market size is forecast to increase by USD 763.9 million, at a CAGR of 40.2% between 2024 and 2029.

    The market is experiencing significant growth, driven by the increasing integration of Artificial Intelligence (AI) and Machine Learning (ML) technologies. This fusion enables organizations to derive deeper insights from their data, fueling business innovation and decision-making. Another trend shaping the market is the emergence of containerization and microservices in data science platforms. This approach offers enhanced flexibility, scalability, and efficiency, making it an attractive choice for businesses seeking to streamline their data science operations. However, the market also faces challenges. Data privacy and security remain critical concerns, with the increasing volume and complexity of data posing significant risks. Ensuring robust data security and privacy measures is essential for companies to maintain customer trust and comply with regulatory requirements. Additionally, managing the complexity of data science platforms and ensuring seamless integration with existing systems can be a daunting task, requiring significant investment in resources and expertise. Companies must navigate these challenges effectively to capitalize on the market's opportunities and stay competitive in the rapidly evolving data landscape.

    What will be the Size of the Data Science Platform Market during the forecast period?

    Explore in-depth regional segment analysis with market size data - historical 2019-2023 and forecasts 2025-2029 - in the full report.
    Request Free SampleThe market continues to evolve, driven by the increasing demand for advanced analytics and artificial intelligence solutions across various sectors. Real-time analytics and classification models are at the forefront of this evolution, with APIs integrations enabling seamless implementation. Deep learning and model deployment are crucial components, powering applications such as fraud detection and customer segmentation. Data science platforms provide essential tools for data cleaning and data transformation, ensuring data integrity for big data analytics. Feature engineering and data visualization facilitate model training and evaluation, while data security and data governance ensure data privacy and compliance. Machine learning algorithms, including regression models and clustering models, are integral to predictive modeling and anomaly detection. Statistical analysis and time series analysis provide valuable insights, while ETL processes streamline data integration. Cloud computing enables scalability and cost savings, while risk management and algorithm selection optimize model performance. Natural language processing and sentiment analysis offer new opportunities for data storytelling and computer vision. Supply chain optimization and recommendation engines are among the latest applications of data science platforms, demonstrating their versatility and continuous value proposition. Data mining and data warehousing provide the foundation for these advanced analytics capabilities.

    How is this Data Science Platform Industry segmented?

    The data science platform industry research report provides comprehensive data (region-wise segment analysis), with forecasts and estimates in 'USD million' for the period 2025-2029, as well as historical data from 2019-2023 for the following segments. DeploymentOn-premisesCloudComponentPlatformServicesEnd-userBFSIRetail and e-commerceManufacturingMedia and entertainmentOthersSectorLarge enterprisesSMEsApplicationData PreparationData VisualizationMachine LearningPredictive AnalyticsData GovernanceOthersGeographyNorth AmericaUSCanadaEuropeFranceGermanyUKMiddle East and AfricaUAEAPACChinaIndiaJapanSouth AmericaBrazilRest of World (ROW)

    By Deployment Insights

    The on-premises segment is estimated to witness significant growth during the forecast period.In the dynamic the market, businesses increasingly adopt solutions to gain real-time insights from their data, enabling them to make informed decisions. Classification models and deep learning algorithms are integral parts of these platforms, providing capabilities for fraud detection, customer segmentation, and predictive modeling. API integrations facilitate seamless data exchange between systems, while data security measures ensure the protection of valuable business information. Big data analytics and feature engineering are essential for deriving meaningful insights from vast datasets. Data transformation, data mining, and statistical analysis are crucial processes in data preparation and discovery. Machine learning models, including regression and clustering, are employed for model training and evaluation. Time series analysis and natural language processing are valuable tools for understanding trends and customer sen

  9. U

    Unsupervised Learning Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The unsupervised learning market is experiencing robust growth, driven by the increasing need for businesses to extract meaningful insights from large, unstructured datasets. This market is projected to be valued at approximately $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of big data and the need for efficient data analysis are primary drivers. Businesses across various sectors, including finance, healthcare, and retail, are increasingly adopting unsupervised learning techniques like clustering and anomaly detection to identify patterns, predict customer behavior, and optimize operational efficiency. Furthermore, advancements in machine learning algorithms, improved computational power, and the availability of cloud-based solutions are further accelerating market growth. The segment dominated by cloud-based solutions is growing faster than the on-premise segment, reflecting a broader industry shift toward cloud computing and its scalability advantages. Large enterprises represent a significant portion of the market, owing to their greater resources and willingness to invest in sophisticated analytics capabilities. However, challenges remain, including the complexity of implementing and interpreting unsupervised learning models, the need for specialized expertise, and concerns regarding data privacy and security. Despite these challenges, the long-term outlook for the unsupervised learning market remains positive. The continuous evolution of machine learning algorithms and the increasing availability of user-friendly tools are expected to lower the barrier to entry for businesses of all sizes. Furthermore, the growing adoption of artificial intelligence (AI) across various industries will further fuel demand for unsupervised learning solutions. The market is witnessing considerable geographic expansion, with North America currently holding a significant market share due to the presence of major technology companies and a well-established IT infrastructure. However, other regions, particularly Asia-Pacific, are also witnessing substantial growth, driven by rapid digitalization and increasing investment in data analytics. Competition in the market is intense, with established players like Microsoft, IBM, and Google vying for market share alongside specialized vendors like RapidMiner and H2o.ai. The continued innovation and development of advanced algorithms and platforms will shape the competitive landscape in the coming years.

  10. Working with new clusters vs. usage of same cluster globally 2023

    • statista.com
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Working with new clusters vs. usage of same cluster globally 2023 [Dataset]. https://www.statista.com/statistics/1451639/creation-of-new-clusters/
    Explore at:
    Dataset updated
    Feb 11, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Time period covered
    2023
    Area covered
    Worldwide
    Description

    In 2023, the majority of respondents worldwide reported that they work without a dedicated cluster, with a share of almost 50 percent of those surveyed reporting the same. Only 30 percent reported that they create a new cluster for each development task.

  11. m

    Data Buffalo Toraja

    • data.mendeley.com
    Updated May 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Rachman Manga (2025). Data Buffalo Toraja [Dataset]. http://doi.org/10.17632/kbft73pdkw.2
    Explore at:
    Dataset updated
    May 16, 2025
    Authors
    Abdul Rachman Manga
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    This data was taken directly in the Toraja area using a digital camera, a minimum shooting distance of 3 m in video form, the results of the shooting are divided into frames

  12. d

    Reference list of 265 sources used for the discovery of relationships...

    • search.dataone.org
    • doi.pangaea.de
    Updated Feb 28, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Bernard, Jürgen; Ruppert, Tobias; Scherer, Maximilian; Schreck, Tobias; Kohlhammer, Jörn (2018). Reference list of 265 sources used for the discovery of relationships between data clusters and metadata properties [Dataset]. http://doi.org/10.1594/PANGAEA.785666
    Explore at:
    Dataset updated
    Feb 28, 2018
    Dataset provided by
    PANGAEA Data Publisher for Earth and Environmental Science
    Authors
    Bernard, Jürgen; Ruppert, Tobias; Scherer, Maximilian; Schreck, Tobias; Kohlhammer, Jörn
    Time period covered
    Jan 1, 1992 - Jun 30, 2016
    Area covered
    Description

    Visual cluster analysis provides valuable tools that help analysts to understand large data sets in terms of representative clusters and relationships thereof. Often, the found clusters are to be understood in context of belonging categorical, numerical or textual metadata which are given for the data elements. While often not part of the clustering process, such metadata play an important role and need to be considered during the interactive cluster exploration process. Traditionally, linked-views allow to relate (or loosely speaking: correlate) clusters with metadata or other properties of the underlying cluster data. Manually inspecting the distribution of metadata for each cluster in a linked-view approach is tedious, specially for large data sets, where a large search problem arises. Fully interactive search for potentially useful or interesting cluster to metadata relationships may constitute a cumbersome and long process. To remedy this problem, we propose a novel approach for guiding users in discovering interesting relationships between clusters and associated metadata. Its goal is to guide the analyst through the potentially huge search space. We focus in our work on metadata of categorical type, which can be summarized for a cluster in form of a histogram. We start from a given visual cluster representation, and compute certain measures of interestingness defined on the distribution of metadata categories for the clusters. These measures are used to automatically score and rank the clusters for potential interestingness regarding the distribution of categorical metadata. Identified interesting relationships are highlighted in the visual cluster representation for easy inspection by the user. We present a system implementing an encompassing, yet extensible, set of interestingness scores for categorical metadata, which can also be extended to numerical metadata. Appropriate visual representations are provided for showing the visual correlations, as well as the calculated ranking scores. Focusing on clusters of time series data, we test our approach on a large real-world data set of time-oriented scientific research data, demonstrating how specific interesting views are automatically identified, supporting the analyst discovering interesting and visually understandable relationships.

  13. c

    Data from: A Generic Local Algorithm for Mining Data Streams in Large...

    • s.cnmilf.com
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://s.cnmilf.com/user74170196/https/catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  14. A

    Advanced Analytics Enablement Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Apr 30, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Advanced Analytics Enablement Report [Dataset]. https://www.datainsightsmarket.com/reports/advanced-analytics-enablement-505993
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Apr 30, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Advanced Analytics Enablement market is experiencing robust growth, driven by the increasing adoption of data-driven decision-making across industries. The market, estimated at $150 billion in 2025, is projected to exhibit a Compound Annual Growth Rate (CAGR) of 15% from 2025 to 2033, reaching a significant market value by 2033. This expansion is fueled by several key factors. The proliferation of big data and the need for organizations, both SMEs and large enterprises, to extract actionable insights are major drivers. Furthermore, advancements in predictive analytics, clustering algorithms, and sophisticated statistical features are enhancing the capabilities of advanced analytics platforms, making them more accessible and effective. Growing demand for improved operational efficiency, risk mitigation, and enhanced customer experiences are further bolstering market growth. While data security concerns and the need for skilled professionals represent potential restraints, the overall market outlook remains positive. Segmentation analysis reveals a significant demand across diverse applications, with both SMEs and large enterprises actively adopting advanced analytics solutions. Predictive analytics currently holds the largest segment share, reflecting its critical role in forecasting and strategic planning. However, the adoption of other segments like clustering, calculations and statistical features is rapidly growing as organizations seek more comprehensive data analysis capabilities. Geographically, North America currently dominates the market due to early adoption and a well-established technological infrastructure. However, Asia-Pacific is expected to witness the fastest growth rate during the forecast period driven by increasing digitalization and economic growth in countries like China and India. The competitive landscape comprises a mix of established players like IBM, Amazon Web Services, and Deloitte, along with specialized analytics firms and emerging technology providers. This competitive dynamic will likely fuel innovation and drive further market expansion.

  15. C

    Clustering Software Market Report

    • promarketreports.com
    doc, pdf, ppt
    Updated Feb 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pro Market Reports (2025). Clustering Software Market Report [Dataset]. https://www.promarketreports.com/reports/clustering-software-market-18406
    Explore at:
    ppt, doc, pdfAvailable download formats
    Dataset updated
    Feb 21, 2025
    Dataset authored and provided by
    Pro Market Reports
    License

    https://www.promarketreports.com/privacy-policyhttps://www.promarketreports.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The clustering software market is projected to grow from USD 4.62 billion in 2025 to USD 13.42 billion by 2033, at a CAGR of 14.39% from 2025 to 2033. The growth of the market is attributed to the increasing adoption of big data analytics, the need for effective data management, and the growing demand for personalized marketing and customer segmentation. Key drivers of the market include the increasing adoption of big data analytics, the need for effective data management, and the growing demand for personalized marketing and customer segmentation. Key trends in the market include the rise of self-service clustering solutions, the increasing popularity of cloud-based deployment models, and the growing adoption of clustering software in various industry verticals. Key restraints in the market include the lack of skilled professionals, the high cost of implementation, and the complexity of data integration. Key segments of the market include solution type, deployment type, and industry vertical. Key companies in the market include Informatica Corporation, Splunk Inc., Oracle Corporation, Google LLC, SAP SE, SAS Institute Inc., Micro Focus International plc, Alteryx Inc., Tibco Software Inc., RapidMiner Inc., Amazon Web Services Inc., Microsoft Corporation, IBM Corporation, Qubole Inc., and Teradata Corporation. The global clustering software market is poised to witness significant growth in the coming years, driven by the increasing adoption of advanced analytics and data-driven decision-making. The market was valued at USD 2.5 billion in 2022 and is projected to reach USD 7.2 billion by 2029, exhibiting a CAGR of 15.2% during the forecast period. Key drivers for this market are: Growth in big data analytics Increasing demand for customer segmentation Rise in cloud computing Advancements in artificial intelligence Adoption in healthcare sector. Potential restraints include: Rising adoption of cloudbased analytics Growing demand for personalized recommendations Advances in machine learning and AI Increasing adoption of data science techniques Growing focus on data security and compliance.

  16. e

    Clustering

    • paper.erudition.co.in
    html
    Updated Nov 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Einetic (2025). Clustering [Dataset]. https://paper.erudition.co.in/makaut/master-of-computer-applications-2-years/3/basic-data-science
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Nov 11, 2023
    Dataset authored and provided by
    Einetic
    License

    https://paper.erudition.co.in/termshttps://paper.erudition.co.in/terms

    Description

    Question Paper Solutions of chapter Clustering of Basic Data Science, 3rd Semester , Master of Computer Applications (2 Years)

  17. Data from: Adaptive weighted multi-view subspace clustering method for...

    • figshare.com
    zip
    Updated Apr 10, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Qiliang Liu; Zexin Lu; Weihua Huan; Chong Fan (2024). Adaptive weighted multi-view subspace clustering method for recognizing urban functions from multi-source social sensing data [Dataset]. http://doi.org/10.6084/m9.figshare.24115734.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 10, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Qiliang Liu; Zexin Lu; Weihua Huan; Chong Fan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Although multi-view clustering has been successfully used to fuse multi-source social sensing data, the adaptive determination of fusion weights for high-dimensional and noisy multi-source social sensing data remains challenging. Therefore, we propose an adaptive weighted multi-view subspace clustering (AWMSC) method. Firstly, we use two neural networks to map multi-source data into a common latent representation and multiple specific latent representations, which serve as the query vector and input vectors of the attention mechanism, respectively. Then, the weight of each type of data is calculated based on the attention mechanism. Finally, the specific latent representations of the multi-source data are weighted and fused into a shared subspace representation, which is used as the input of the spectral clustering algorithm to obtain clustering results. AWMSC is applied to identify urban functional zones in Beijing using bus transactions, taxi trajectories, and points of interest datasets. The results show that AWMSC outperforms the typical single-view, weighted-average, and representative multi-view methods. AWMSC can obtain a comprehensive understanding of urban functional zones which may help government departments make more accurate strategic decisions.

  18. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  19. f

    Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means...

    • plos.figshare.com
    tiff
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yaofang Xu; Jiayi Wu; Chang-Cheng Yin; Youdong Mao (2023). Unsupervised Cryo-EM Data Clustering through Adaptively Constrained K-Means Algorithm [Dataset]. http://doi.org/10.1371/journal.pone.0167765
    Explore at:
    tiffAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Yaofang Xu; Jiayi Wu; Chang-Cheng Yin; Youdong Mao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In single-particle cryo-electron microscopy (cryo-EM), K-means clustering algorithm is widely used in unsupervised 2D classification of projection images of biological macromolecules. 3D ab initio reconstruction requires accurate unsupervised classification in order to separate molecular projections of distinct orientations. Due to background noise in single-particle images and uncertainty of molecular orientations, traditional K-means clustering algorithm may classify images into wrong classes and produce classes with a large variation in membership. Overcoming these limitations requires further development on clustering algorithms for cryo-EM data analysis. We propose a novel unsupervised data clustering method building upon the traditional K-means algorithm. By introducing an adaptive constraint term in the objective function, our algorithm not only avoids a large variation in class sizes but also produces more accurate data clustering. Applications of this approach to both simulated and experimental cryo-EM data demonstrate that our algorithm is a significantly improved alterative to the traditional K-means algorithm in single-particle cryo-EM analysis.

  20. o

    Data Science Career Opportunities (USA)

    • opendatabay.com
    .undefined
    Updated Jul 3, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Datasimple (2025). Data Science Career Opportunities (USA) [Dataset]. https://www.opendatabay.com/data/ai-ml/6d1c5965-8fb2-4749-a8bd-f1c40861b401
    Explore at:
    .undefinedAvailable download formats
    Dataset updated
    Jul 3, 2025
    Dataset authored and provided by
    Datasimple
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Area covered
    United States, Data Science and Analytics
    Description

    This dataset provides valuable insights into the US data science job market, containing detailed job listings scraped from the Indeed web portal on 20th November 2022. It is ideal for those seeking to understand job trends, analyse salary expectations, or develop skills in data analysis, machine learning, and natural language processing. The dataset's purpose is to offer a snapshot of available positions across various data science roles, including data scientists, machine learning engineers, and business analysts. It serves as a rich resource for exploratory data analysis, feature engineering, and predictive modelling tasks.

    Columns

    • Title: The job title of the listed position.
    • Company: The hiring company posting the job.
    • Location: The geographic location of the job within the US.
    • Rating: The rating associated with the job or company.
    • Date: Indicates how long the job had been posted prior to 20th November 2022.
    • Salary: The salary information provided in US Dollars ($). Please note that many entries in this column may be missing as salary details are often not disclosed in job listings.
    • Description: A brief summary description of the job.
    • Links: The direct link to the original job posting on the Indeed platform.
    • Descriptions: The full-length description of the job, encompassing all details found in the complete job posting.

    Distribution

    This dataset is provided as a single data file, typically in CSV format. It comprises 1200 rows (records) and 9 distinct columns. The file name is data_science_jobs_indeed_us.csv.

    Usage

    This dataset is perfectly suited for a variety of analytical tasks and applications: * Data Cleaning and Preparation: Practise handling missing values, especially in the 'Salary' column. * Exploratory Data Analysis (EDA): Discover trends in job titles, company types, and locations. * Feature Engineering: Extract new features from the 'Descriptions' column, such as required skills, education levels, or experience. * Classification and Clustering: Develop models for salary prediction, or perform skill clustering analysis to guide curriculum development. * Text Processing and Natural Language Processing (NLP): Analyse job descriptions to identify common skill demands or industry buzzwords.

    Coverage

    The dataset's geographic scope is limited to job postings within the United States. All data was collected on 20th November 2022, with the 'Date' column providing information on how long each job had been active before this date. The dataset covers a wide range of data science positions, including roles such as data scientist, machine learning engineer, data engineer, business analyst, and data science manager. It is important to note the presence of many missing entries in the 'Salary' column, reflecting common data availability challenges in job listings.

    License

    CCO

    Who Can Use It

    This dataset is an excellent resource for: * Aspiring Data Scientists and Machine Learning Engineers: To sharpen their data cleaning, EDA, and model deployment skills. * Educators and Curriculum Developers: To inform and guide the development of relevant data science and analytics courses based on real-world job market demands. * Job Seekers: To understand the current landscape of data science roles, required skills, and potential salary ranges. * Researchers and Analysts: To glean insights into labour market trends in the data science domain. * Human Resources Professionals: To benchmark job roles, skill requirements, and compensation within the industry.

    Dataset Name Suggestions

    • Indeed US Data Science Job Insights
    • US Data Science Job Market Analysis
    • Data Professional Job Postings (Indeed USA)
    • Data Science Career Opportunities (USA)

    Attributes

    Original Data Source: Data Science Job Postings (Indeed USA)

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler (2024). Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure [Dataset]. http://doi.org/10.6084/m9.figshare.25594361.v1

Data from: Data Nuggets: A Method for Reducing Big Data While Preserving Data Structure

Related Article
Explore at:
tarAvailable download formats
Dataset updated
Jun 11, 2024
Dataset provided by
Taylor & Francis
Authors
Traymon E. Beavers; Ge Cheng; Yajie Duan; Javier Cabrera; Mariusz Lubomirski; Dhammika Amaratunga; Jeffrey E. Teigler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Big data, with N × P dimension where N is extremely large, has created new challenges for data analysis, particularly in the realm of creating meaningful clusters of data. Clustering techniques, such as K-means or hierarchical clustering are popular methods for performing exploratory analysis on large datasets. Unfortunately, these methods are not always possible to apply to big data due to memory or time constraints generated by calculations of order P*N(N−1)2. To circumvent this problem, typically the clustering technique is applied to a random sample drawn from the dataset; however, a weakness is that the structure of the dataset, particularly at the edges, is not necessarily maintained. We propose a new solution through the concept of “data nuggets”, which reduces a large dataset into a small collection of nuggets of data, each containing a center, weight, and scale parameter. The data nuggets are then input into algorithms that compute methods such as principal components analysis and clustering in a more computationally efficient manner. We show the consistency of the data nuggets based covariance estimator and apply the methodology of data nuggets to perform exploratory analysis of a flow cytometry dataset containing over one million observations using PCA and K-means clustering for weighted observations. Supplementary materials for this article are available online.

Search
Clear search
Close search
Google apps
Main menu