81 datasets found
  1. Data from: An Empirical Study of Activity, Popularity, Size, Testing, and...

    • zenodo.org
    csv
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant (2020). An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration [Dataset]. http://doi.org/10.5281/zenodo.439362
    Explore at:
    csvAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A good understanding of the practices followed by software development projects can positively impact their success --- particularly for attracting talent and on-boarding new members. In this paper, we perform a cluster analysis to classify software projects that follow continuous integration in terms of their activity, popularity, size, testing, and stability. Based on this analysis, we identify and discuss four different groups of repositories that have distinct characteristics that separates them from the other groups. With this new understanding, we encourage open source projects to acknowledge and advertise their preferences according to these defining characteristics, so that they can recruit developers who share similar values.

  2. f

    The results of the statistical analysis tests.

    • plos.figshare.com
    xls
    Updated Jul 5, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra (2023). The results of the statistical analysis tests. [Dataset]. http://doi.org/10.1371/journal.pone.0288044.t007
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 5, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.

  3. Application Research of Clustering on kmeans

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
    Explore at:
    zip(34507 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    ddpr raju
    Description

    Dataset

    This dataset was created by ddpr raju

    Contents

    It contains the following files:

  4. Artificial dataset for clustering algorithms

    • figshare.com
    zip
    Updated Jun 4, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa (2023). Artificial dataset for clustering algorithms [Dataset]. http://doi.org/10.6084/m9.figshare.5412091.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jun 4, 2023
    Dataset provided by
    figshare
    Authors
    Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.

  5. r

    Synthetic datasets for evaluation of dynamic data streams

    • researchdata.edu.au
    Updated Apr 4, 2013
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Curtin University (2013). Synthetic datasets for evaluation of dynamic data streams [Dataset]. https://researchdata.edu.au/synthetic-datasets-evaluation-dynamic-streams/3631?source=suggested_datasets
    Explore at:
    Dataset updated
    Apr 4, 2013
    Dataset provided by
    Curtin University
    Time period covered
    2007 - Jan 1, 2008
    Description

    This dataset package includes two synthetic datasets with challenging features including varying density, local density differences, shared boundaries and irregular shapes.

  6. PHD Thesis: Graph Set Data Mining - Clustering and Pattern Mining in the...

    • zenodo.org
    application/gzip
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Till Schäfer; Till Schäfer (2025). PHD Thesis: Graph Set Data Mining - Clustering and Pattern Mining in the Context of Cheminformatics - Evaluation Data [Dataset]. http://doi.org/10.5281/zenodo.8298921
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Till Schäfer; Till Schäfer
    Description

    Evaluation data for the PHD thesis:

    Graph Set Data Mining
    Clustering and Pattern Mining in the Context of Cheminformatics

    zur Erlangung des Grades eines
    Doktors der Naturwissenschaften
    der Technischen Universität Dortmund
    an der Fakultät für Informatik

    von

    TIll Schäfer

  7. m

    Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

    • data.mendeley.com
    Updated Nov 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
    Explore at:
    Dataset updated
    Nov 18, 2020
    Authors
    Nuno Antonio
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Portugal, Lisbon
    Description

    Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.

  8. u

    Data from: IJEE Educational Data Mining

    • produccioncientifica.uca.es
    Updated 2016
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Palomo-Duarte, Manuel; Palomo-Duarte, Manuel (2016). IJEE Educational Data Mining [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d1?lang=de
    Explore at:
    Dataset updated
    2016
    Authors
    Palomo-Duarte, Manuel; Palomo-Duarte, Manuel
    Description

    Histograms and results of k-means and Ward's clustering for IJEE special issue

    The fileset contains information from three sources:

    1. Histograms files:
    * Lexical_histogram.png (histogram of lexical error ratios)
    * Grammatical_histogram.png (histogram of grammatical error ratios)

    2. K-means clustering files:
    * elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
    * Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
    * Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
    * Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
    * Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

    3. Ward’s Agglomerative Hierarchical Clustering files:
    * Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
    * Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
    * Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
    * Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
    * Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
    * Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.

  9. Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...

    • data.nasa.gov
    • data.staging.idas-ds1.appdat.jsc.nasa.gov
    Updated Mar 31, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    data.nasa.gov (2025). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
    Explore at:
    Dataset updated
    Mar 31, 2025
    Dataset provided by
    NASAhttp://nasa.gov/
    Description

    In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.

  10. Data Mining For Business

    • kaggle.com
    Updated May 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balal H (2022). Data Mining For Business [Dataset]. https://www.kaggle.com/datasets/balalh/data-mining-for-business
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Balal H
    Description

    Dataset

    This dataset was created by Balal H

    Contents

  11. ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • zenodo.org
    • elki-project.github.io
    application/gzip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
    Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
    In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek
    Evaluation of Multiple Clustering Solutions
    In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
    On Evaluation of Outlier Rankings and Outlier Scores
    In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

    Feature typeDescriptionFiles
    Object numberSparse 1000 dimensional vectors that give the true object assignmentobjs.arff.gz
    RGB color histogramsStandard RGB color histograms (uniform binning)aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    HSV color histogramsStandard HSV/HSB color histograms in various binningsaloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    Color similiarityAverage similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    Haralick featuresFirst 13 Haralick features (radius 1 pixel)aloi-haralick-1.csv.gz
    Front to backVectors representing front face vs. back faces of individual objectsfront.arff.gz
    Basic lightVectors indicating basic light situationslight.arff.gz
    Manual annotationsManually annotated object groups of semantically related objects such as cupsmanual1.arff.gz

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

    Feature typeDescriptionFiles
    RGB HistogramsDownsampled to 100000 objects (553 outliers)aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    Downsampled to 75000 objects (717 outliers)aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    Downsampled to 50000 objects (1508 outliers)aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
  12. m

    Data Buffalo Toraja

    • data.mendeley.com
    Updated Nov 21, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abdul Rachman Manga (2024). Data Buffalo Toraja [Dataset]. http://doi.org/10.17632/kbft73pdkw.1
    Explore at:
    Dataset updated
    Nov 21, 2024
    Authors
    Abdul Rachman Manga
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Area covered
    Buffalo
    Description

    This data was taken directly in the Toraja area using a digital camera, a minimum shooting distance of 3 m in video form, the results of the shooting are divided into frames

  13. f

    Data from: Hidden Room game data clustering in University of Cadiz (Spain)...

    • figshare.com
    • produccioncientifica.uca.es
    png
    Updated Apr 30, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game data clustering in University of Cadiz (Spain) by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194573.v2
    Explore at:
    pngAvailable download formats
    Dataset updated
    Apr 30, 2018
    Dataset provided by
    figshare
    Authors
    Manuel Palomo-duarte; Anke Berns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Histograms and results of k-means and Ward's clustering for Hidden Room game in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*
    elbow-lex kmeans.png (clustering by lexical aspects: error curves
    obtained for applying elbow method to determinate the optimal number of
    clusters)* cube-lex kmeans.png (clustering by lexical aspects: a
    three-dimensional representation of clusters obtained after applying
    k-means method)* Lexical_clusters (table) kmeans.xls (clustering by
    lexical aspects: centroids, standard deviations and number of instances
    assigned to each cluster)* elbow-gram kmeans.png (clustering by
    grammatical aspects: error curves obtained for applying elbow method to
    determinate the optimal number of clusters)* cube-gramm kmeans.png
    (clustering by grammatical aspects: a three-dimensional representation
    of clusters obtained after applying k-means method)*
    Grammatical_clusters (table) kmeans.xls (clustering by grammatical
    aspects: centroids, standard deviations and number of instances assigned
    to each cluster)* elbow-lexgram kmeans.png (clustering by lexical
    and grammatical aspects: error curves obtained for applying elbow method
    to determinate the optimal number of clusters)*
    Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical
    and grammatical aspects: centroids, standard deviations and number of
    instances assigned to each cluster)*
    Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
    Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
  14. d

    Data from: A Generic Local Algorithm for Mining Data Streams in Large...

    • catalog.data.gov
    • datasets.ai
    • +3more
    Updated Apr 10, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
    Explore at:
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Dashlink
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  15. m

    Data Mining Software Market Size, Share & Industry Trends Analysis 2033

    • marketresearchintellect.com
    Updated Jul 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Market Research Intellect (2020). Data Mining Software Market Size, Share & Industry Trends Analysis 2033 [Dataset]. https://www.marketresearchintellect.com/product/global-data-mining-software-market-size-and-forecast/
    Explore at:
    Dataset updated
    Jul 13, 2020
    Dataset authored and provided by
    Market Research Intellect
    License

    https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy

    Area covered
    Global
    Description

    The size and share of this market is categorized based on Type (Data extraction tools, Predictive analytics software, Text mining tools, Web mining tools, Data clustering tools) and Application (Customer insights, Market research, Trend analysis, Risk management, Pattern recognition) and geographical regions (North America, Europe, Asia-Pacific, South America, Middle-East and Africa).

  16. Ocean Carbon States Database and Toolbox

    • zenodo.org
    • data.niaid.nih.gov
    zip
    Updated Jan 24, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto (2020). Ocean Carbon States Database and Toolbox [Dataset]. http://doi.org/10.5281/zenodo.996892
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 24, 2020
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The "Ocean Carbon States Database and Toolbox" includes observational and climate model datasets and matlab scripts to compute regimes of the ocean carbon cycle.

  17. t

    SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

    • researchdata.tuwien.ac.at
    • researchdata.tuwien.at
    zip
    Updated Sep 17, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Iglesias Vazquez; Felix Iglesias Vazquez (2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 17, 2024
    Dataset provided by
    TU Wien
    Authors
    Felix Iglesias Vazquez; Felix Iglesias Vazquez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDOstreamclust Evaluation Tests

    conducted for the paper: Stream Clustering Robust to Concept Drift

    Context and methodology

    SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

    In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

    This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

    Docker

    A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

    Technical details

    Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

    • [data] contains datasets in ARFF format.
    • [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).
    • "dependencies.sh" lists and installs python dependencies.
    • "pysdoclust-stream-main.zip" contains the SDOstreamclust python package.
    • "README.md" shows details and intructions to use this repository.
    • "run.sh" runs the complete experiments.
    • "run_comp.py"for running experiments specified by arguments.
    • "TSindex.py" implements functions for the Temporal Silhouette index.
    Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

    License

    The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.

  18. Z

    Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mirella M. Moro (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904638
    Explore at:
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Mariana O. Silva
    Laís Mota
    Mirella M. Moro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

    The attractive features of MusicOSet include:

    Integration and centralization of different musical data sources

    Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018

    Enriched metadata for music, artists, and albums from the US popular music industry

    Availability of acoustic and lyrical resources

    Unrestricted access in two formats: SQL database and compressed .csv files

    Data# Records
    Songs20,405
    Artists11,518
    Albums26,522
    Lyrics19,664
    Acoustic Features20,405
    Genres1,561
  19. U

    Unsupervised Learning Report

    • archivemarketresearch.com
    doc, pdf, ppt
    Updated Mar 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
    Explore at:
    doc, ppt, pdfAvailable download formats
    Dataset updated
    Mar 13, 2025
    Dataset authored and provided by
    Archive Market Research
    License

    https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The unsupervised learning market is experiencing robust growth, driven by the increasing need for businesses to extract meaningful insights from large, unstructured datasets. This market is projected to be valued at approximately $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of big data and the need for efficient data analysis are primary drivers. Businesses across various sectors, including finance, healthcare, and retail, are increasingly adopting unsupervised learning techniques like clustering and anomaly detection to identify patterns, predict customer behavior, and optimize operational efficiency. Furthermore, advancements in machine learning algorithms, improved computational power, and the availability of cloud-based solutions are further accelerating market growth. The segment dominated by cloud-based solutions is growing faster than the on-premise segment, reflecting a broader industry shift toward cloud computing and its scalability advantages. Large enterprises represent a significant portion of the market, owing to their greater resources and willingness to invest in sophisticated analytics capabilities. However, challenges remain, including the complexity of implementing and interpreting unsupervised learning models, the need for specialized expertise, and concerns regarding data privacy and security. Despite these challenges, the long-term outlook for the unsupervised learning market remains positive. The continuous evolution of machine learning algorithms and the increasing availability of user-friendly tools are expected to lower the barrier to entry for businesses of all sizes. Furthermore, the growing adoption of artificial intelligence (AI) across various industries will further fuel demand for unsupervised learning solutions. The market is witnessing considerable geographic expansion, with North America currently holding a significant market share due to the presence of major technology companies and a well-established IT infrastructure. However, other regions, particularly Asia-Pacific, are also witnessing substantial growth, driven by rapid digitalization and increasing investment in data analytics. Competition in the market is intense, with established players like Microsoft, IBM, and Google vying for market share alongside specialized vendors like RapidMiner and H2o.ai. The continued innovation and development of advanced algorithms and platforms will shape the competitive landscape in the coming years.

  20. E-Commerce Products Dataset For Record Linkage

    • kaggle.com
    Updated May 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Furkan Gözükara (2025). E-Commerce Products Dataset For Record Linkage [Dataset]. https://www.kaggle.com/datasets/furkangozukara/ecommerce-products-dataset-for-record-linkage/versions/1035
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    May 7, 2025
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Furkan Gözükara
    Description

    -> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234

    @article{10.1093/comjnl/bxab179, author = {Gözükara, Furkan and Özel, Selma Ayşe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }

    -> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article

    -> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes

    -> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments

    -> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant (2020). An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration [Dataset]. http://doi.org/10.5281/zenodo.439362
Organization logo

Data from: An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration

Related Article
Explore at:
csvAvailable download formats
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A good understanding of the practices followed by software development projects can positively impact their success --- particularly for attracting talent and on-boarding new members. In this paper, we perform a cluster analysis to classify software projects that follow continuous integration in terms of their activity, popularity, size, testing, and stability. Based on this analysis, we identify and discuss four different groups of repositories that have distinct characteristics that separates them from the other groups. With this new understanding, we encourage open source projects to acknowledge and advertise their preferences according to these defining characteristics, so that they can recruit developers who share similar values.

Search
Clear search
Close search
Google apps
Main menu