81 datasets found

Data from: An Empirical Study of Activity, Popularity, Size, Testing, and...
zenodo.org
csv
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant (2020). An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration [Dataset]. http://doi.org/10.5281/zenodo.439362
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.439362
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
A good understanding of the practices followed by software development projects can positively impact their success --- particularly for attracting talent and on-boarding new members. In this paper, we perform a cluster analysis to classify software projects that follow continuous integration in terms of their activity, popularity, size, testing, and stability. Based on this analysis, we identify and discuss four different groups of repositories that have distinct characteristics that separates them from the other groups. With this new understanding, we encourage open source projects to acknowledge and advertise their preferences according to these defining characteristics, so that they can recruit developers who share similar values.
f
The results of the statistical analysis tests.
plos.figshare.com
xls
Updated Jul 5, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra (2023). The results of the statistical analysis tests. [Dataset]. http://doi.org/10.1371/journal.pone.0288044.t007
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0288044.t007
Dataset updated
Jul 5, 2023
Dataset provided by
PLOS ONE
Authors
Sinan Q. Salih; AbdulRahman A. Alsewari; H. A. Wahab; Mustafa K. A. Mohammed; Tarik A. Rashid; Debashish Das; Shadi S. Basurra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The retrieval of important information from a dataset requires applying a special data mining technique known as data clustering (DC). DC classifies similar objects into a groups of similar characteristics. Clustering involves grouping the data around k-cluster centres that typically are selected randomly. Recently, the issues behind DC have called for a search for an alternative solution. Recently, a nature-based optimization algorithm named Black Hole Algorithm (BHA) was developed to address the several well-known optimization problems. The BHA is a metaheuristic (population-based) that mimics the event around the natural phenomena of black holes, whereby an individual star represents the potential solutions revolving around the solution space. The original BHA algorithm showed better performance compared to other algorithms when applied to a benchmark dataset, despite its poor exploration capability. Hence, this paper presents a multi-population version of BHA as a generalization of the BHA called MBHA wherein the performance of the algorithm is not dependent on the best-found solution but a set of generated best solutions. The method formulated was subjected to testing using a set of nine widespread and popular benchmark test functions. The ensuing experimental outcomes indicated the highly precise results generated by the method compared to BHA and comparable algorithms in the study, as well as excellent robustness. Furthermore, the proposed MBHA achieved a high rate of convergence on six real datasets (collected from the UCL machine learning lab), making it suitable for DC problems. Lastly, the evaluations conclusively indicated the appropriateness of the proposed algorithm to resolve DC issues.
Application Research of Clustering on kmeans
kaggle.com
zip
Updated Feb 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
Explore at:
zip(34507 bytes)Available download formats
Dataset updated
Feb 27, 2021
Authors
ddpr raju
Description
Dataset

This dataset was created by ddpr raju

Contents

It contains the following files:
Artificial dataset for clustering algorithms
figshare.com
zip
Updated Jun 4, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa (2023). Artificial dataset for clustering algorithms [Dataset]. http://doi.org/10.6084/m9.figshare.5412091.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.5412091.v1
Dataset updated
Jun 4, 2023
Dataset provided by
figshare
Authors
Mayra Zegarra Rodriguez; Cesar H. Comin; Dalcimar Casanova; Odemir M; Diego R. Amancio; Francisco A. Rodrigues; Luciano da F. Costa
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.
r
Synthetic datasets for evaluation of dynamic data streams
researchdata.edu.au
Updated Apr 4, 2013
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Curtin University (2013). Synthetic datasets for evaluation of dynamic data streams [Dataset]. https://researchdata.edu.au/synthetic-datasets-evaluation-dynamic-streams/3631?source=suggested_datasets
Explore at:
Dataset updated
Apr 4, 2013
Dataset provided by
Curtin University
Time period covered
2007 - Jan 1, 2008
Description
This dataset package includes two synthetic datasets with challenging features including varying density, local density differences, shared boundaries and irregular shapes.
PHD Thesis: Graph Set Data Mining - Clustering and Pattern Mining in the...
zenodo.org
application/gzip
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Till Schäfer; Till Schäfer (2025). PHD Thesis: Graph Set Data Mining - Clustering and Pattern Mining in the Context of Cheminformatics - Evaluation Data [Dataset]. http://doi.org/10.5281/zenodo.8298921
Explore at:
application/gzipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8298921
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Till Schäfer; Till Schäfer
Description
Evaluation data for the PHD thesis:

Graph Set Data Mining
Clustering and Pattern Mining in the Context of Cheminformatics

zur Erlangung des Grades eines
Doktors der Naturwissenschaften
der Technischen Universität Dortmund
an der Fakultät für Informatik

von

TIll Schäfer
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
u
Data from: IJEE Educational Data Mining
produccioncientifica.uca.es
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palomo-Duarte, Manuel; Palomo-Duarte, Manuel (2016). IJEE Educational Data Mining [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d1?lang=de
Explore at:
Dataset updated
2016
Authors
Palomo-Duarte, Manuel; Palomo-Duarte, Manuel
Description
Histograms and results of k-means and Ward's clustering for IJEE special issue

The fileset contains information from three sources:

1. Histograms files:
* Lexical_histogram.png (histogram of lexical error ratios)
* Grammatical_histogram.png (histogram of grammatical error ratios)

2. K-means clustering files:
* elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
* Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

3. Ward’s Agglomerative Hierarchical Clustering files:
* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...
data.nasa.gov
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Mar 31, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.nasa.gov (2025). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset - NASA Open Data Portal [Dataset]. https://data.nasa.gov/dataset/local-l2-thresholding-based-data-mining-in-peer-to-peer-systems
Explore at:
Dataset updated
Mar 31, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
Data Mining For Business
kaggle.com
Updated May 7, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Balal H (2022). Data Mining For Business [Dataset]. https://www.kaggle.com/datasets/balalh/data-mining-for-business
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Balal H
Description
Dataset

This dataset was created by Balal H

Contents

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

zenodo.org
elki-project.github.io

application/gzip

Updated May 2, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684

Explore at:

application/gzipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6355684

Dataset updated

May 2, 2024

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Time period covered

2022

Description

These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
On Evaluation of Outlier Rankings and Outlier Scores
In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type	Description	Files
Object number	Sparse 1000 dimensional vectors that give the true object assignment	objs.arff.gz
RGB color histograms	Standard RGB color histograms (uniform binning)	aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
HSV color histograms	Standard HSV/HSB color histograms in various binnings	aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
Color similiarity	Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)	aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
Haralick features	First 13 Haralick features (radius 1 pixel)	aloi-haralick-1.csv.gz
Front to back	Vectors representing front face vs. back faces of individual objects	front.arff.gz
Basic light	Vectors indicating basic light situations	light.arff.gz
Manual annotations	Manually annotated object groups of semantically related objects such as cups	manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type	Description	Files
RGB Histograms	Downsampled to 100000 objects (553 outliers)	aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
	Downsampled to 75000 objects (717 outliers)	aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
	Downsampled to 50000 objects (1508 outliers)	aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz

m
Data Buffalo Toraja
data.mendeley.com
Updated Nov 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abdul Rachman Manga (2024). Data Buffalo Toraja [Dataset]. http://doi.org/10.17632/kbft73pdkw.1
Explore at:
Unique identifier
https://doi.org/10.17632/kbft73pdkw.1
Dataset updated
Nov 21, 2024
Authors
Abdul Rachman Manga
License
Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
License information was derived automatically
Area covered
Buffalo
Description
This data was taken directly in the Toraja area using a digital camera, a minimum shooting distance of 3 m in video form, the results of the shooting are divided into frames
f
Data from: Hidden Room game data clustering in University of Cadiz (Spain)...
figshare.com
produccioncientifica.uca.es
png
Updated Apr 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game data clustering in University of Cadiz (Spain) by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194573.v2
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6194573.v2
Dataset updated
Apr 30, 2018
Dataset provided by
figshare
Authors
Manuel Palomo-duarte; Anke Berns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description

Histograms and results of k-means and Ward's clustering for Hidden Room game in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*

elbow-lex kmeans.png (clustering by lexical aspects: error curves

obtained for applying elbow method to determinate the optimal number of

clusters)* cube-lex kmeans.png (clustering by lexical aspects: a

three-dimensional representation of clusters obtained after applying

k-means method)* Lexical_clusters (table) kmeans.xls (clustering by

lexical aspects: centroids, standard deviations and number of instances

assigned to each cluster)* elbow-gram kmeans.png (clustering by

grammatical aspects: error curves obtained for applying elbow method to

determinate the optimal number of clusters)* cube-gramm kmeans.png

(clustering by grammatical aspects: a three-dimensional representation

of clusters obtained after applying k-means method)*

Grammatical_clusters (table) kmeans.xls (clustering by grammatical

aspects: centroids, standard deviations and number of instances assigned

to each cluster)* elbow-lexgram kmeans.png (clustering by lexical

and grammatical aspects: error curves obtained for applying elbow method

to determinate the optimal number of clusters)*

Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical

and grammatical aspects: centroids, standard deviations and number of

instances assigned to each cluster)*

Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
catalog.data.gov
datasets.ai
+3more
Updated Apr 10, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2025). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://catalog.data.gov/dataset/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
Dataset updated
Apr 10, 2025
Dataset provided by
Dashlink
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
m
Data Mining Software Market Size, Share & Industry Trends Analysis 2033
marketresearchintellect.com
Updated Jul 13, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Intellect (2020). Data Mining Software Market Size, Share & Industry Trends Analysis 2033 [Dataset]. https://www.marketresearchintellect.com/product/global-data-mining-software-market-size-and-forecast/
Explore at:
Dataset updated
Jul 13, 2020
Dataset authored and provided by
Market Research Intellect
License
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Area covered
Global
Description
The size and share of this market is categorized based on Type (Data extraction tools, Predictive analytics software, Text mining tools, Web mining tools, Data clustering tools) and Application (Customer insights, Market research, Trend analysis, Risk management, Pattern recognition) and geographical regions (North America, Europe, Asia-Pacific, South America, Middle-East and Africa).
Ocean Carbon States Database and Toolbox
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto (2020). Ocean Carbon States Database and Toolbox [Dataset]. http://doi.org/10.5281/zenodo.996892
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.996892
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Ocean Carbon States Database and Toolbox" includes observational and climate model datasets and matlab scripts to compute regimes of the ocean carbon cycle.
t
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
researchdata.tuwien.ac.at
researchdata.tuwien.at
zip
Updated Sep 17, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Felix Iglesias Vazquez; Felix Iglesias Vazquez (2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.48436/xh0w2-q5x18
Dataset updated
Sep 17, 2024
Dataset provided by
TU Wien
Authors
Felix Iglesias Vazquez; Felix Iglesias Vazquez
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests

conducted for the paper: Stream Clustering Robust to Concept Drift

Context and methodology

SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

Docker

A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

Technical details

Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

[data] contains datasets in ARFF format.

[results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).

"dependencies.sh" lists and installs python dependencies.

"pysdoclust-stream-main.zip" contains the SDOstreamclust python package.

"README.md" shows details and intructions to use this repository.

"run.sh" runs the complete experiments.

"run_comp.py"for running experiments specified by arguments.

"TSindex.py" implements functions for the Temporal Silhouette index.

Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

License

The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.
Z
Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining
data.niaid.nih.gov
zenodo.org
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mirella M. Moro (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4904638
Explore at:
Dataset updated
Jun 7, 2021
Dataset provided by
Mariana O. Silva
Laís Mota
Mirella M. Moro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

The attractive features of MusicOSet include:

Integration and centralization of different musical data sources

Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018

Enriched metadata for music, artists, and albums from the US popular music industry

Availability of acoustic and lyrical resources

Unrestricted access in two formats: SQL database and compressed .csv files

Data # Records
Songs 20,405
Artists 11,518
Albums 26,522
Lyrics 19,664
Acoustic Features 20,405
Genres 1,561
U
Unsupervised Learning Report
archivemarketresearch.com
doc, pdf, ppt
Updated Mar 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Archive Market Research (2025). Unsupervised Learning Report [Dataset]. https://www.archivemarketresearch.com/reports/unsupervised-learning-56632
Explore at:
doc, ppt, pdfAvailable download formats
Dataset updated
Mar 13, 2025
Dataset authored and provided by
Archive Market Research
License
https://www.archivemarketresearch.com/privacy-policyhttps://www.archivemarketresearch.com/privacy-policy
Time period covered
2025 - 2033
Area covered
Global
Variables measured
Market Size
Description
The unsupervised learning market is experiencing robust growth, driven by the increasing need for businesses to extract meaningful insights from large, unstructured datasets. This market is projected to be valued at approximately $15 billion in 2025, exhibiting a Compound Annual Growth Rate (CAGR) of 25% from 2025 to 2033. This significant expansion is fueled by several key factors. The proliferation of big data and the need for efficient data analysis are primary drivers. Businesses across various sectors, including finance, healthcare, and retail, are increasingly adopting unsupervised learning techniques like clustering and anomaly detection to identify patterns, predict customer behavior, and optimize operational efficiency. Furthermore, advancements in machine learning algorithms, improved computational power, and the availability of cloud-based solutions are further accelerating market growth. The segment dominated by cloud-based solutions is growing faster than the on-premise segment, reflecting a broader industry shift toward cloud computing and its scalability advantages. Large enterprises represent a significant portion of the market, owing to their greater resources and willingness to invest in sophisticated analytics capabilities. However, challenges remain, including the complexity of implementing and interpreting unsupervised learning models, the need for specialized expertise, and concerns regarding data privacy and security. Despite these challenges, the long-term outlook for the unsupervised learning market remains positive. The continuous evolution of machine learning algorithms and the increasing availability of user-friendly tools are expected to lower the barrier to entry for businesses of all sizes. Furthermore, the growing adoption of artificial intelligence (AI) across various industries will further fuel demand for unsupervised learning solutions. The market is witnessing considerable geographic expansion, with North America currently holding a significant market share due to the presence of major technology companies and a well-established IT infrastructure. However, other regions, particularly Asia-Pacific, are also witnessing substantial growth, driven by rapid digitalization and increasing investment in data analytics. Competition in the market is intense, with established players like Microsoft, IBM, and Google vying for market share alongside specialized vendors like RapidMiner and H2o.ai. The continued innovation and development of advanced algorithms and platforms will shape the competitive landscape in the coming years.
E-Commerce Products Dataset For Record Linkage
kaggle.com
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Furkan Gözükara (2025). E-Commerce Products Dataset For Record Linkage [Dataset]. https://www.kaggle.com/datasets/furkangozukara/ecommerce-products-dataset-for-record-linkage/versions/1035
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Furkan Gözükara
Description
-> If you use Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset, please cite: https://academic.oup.com/comjnl/advance-article-abstract/doi/10.1093/comjnl/bxab179/6425234

@article{10.1093/comjnl/bxab179, author = {Gözükara, Furkan and Özel, Selma Ayşe}, title = "{An Incremental Hierarchical Clustering Based System For Record Linkage In E-Commerce Domain}", journal = {The Computer Journal}, year = {2021}, month = {11}, abstract = "{In this study, a novel record linkage system for E-commerce products is presented. Our system aims to cluster the same products that are crawled from different E-commerce websites into the same cluster. The proposed system achieves a very high success rate by combining both semi-supervised and unsupervised approaches. Unlike the previously proposed systems in the literature, neither a training set nor structured corpora are necessary. The core of the system is based on Hierarchical Agglomerative Clustering (HAC); however, the HAC algorithm is modified to be dynamic such that it can efficiently cluster a stream of incoming new data. Since the proposed system does not depend on any prior data, it can cluster new products. The system uses bag-of-words representation of the product titles, employs a single distance metric, exploits multiple domain-based attributes and does not depend on the characteristics of the natural language used in the product records. To our knowledge, there is no commonly used tool or technique to measure the quality of a clustering task. Therefore in this study, we use ELKI (Environment for Developing KDD-Applications Supported by Index-Structures), an open-source data mining software, for performance measurement of the clustering methods; and show how to use ELKI for this purpose. To evaluate our system, we collect our own dataset and make it publicly available to researchers who study E-commerce product clustering. Our proposed system achieves 96.25\% F-Measure according to our experimental analysis. The other state-of-the-art clustering systems obtain the best 89.12\% F-Measure.}", issn = {0010-4620}, doi = {10.1093/comjnl/bxab179}, url = {https://doi.org/10.1093/comjnl/bxab179}, note = {bxab179}, eprint = {https://academic.oup.com/comjnl/advance-article-pdf/doi/10.1093/comjnl/bxab179/41133297/bxab179.pdf}, }

-> elki-bundle-0.7.2-SNAPSHOT.jar Is the ELKI bundle that we have compiled from the github source code of ELKI. The date of the source code is 6 June 2016. The compile command is as below: ->-> mvn -DskipTests -Dmaven.javadoc.skip=true -P svg,bundle package ->-> Github repository of ELKI: https://github.com/elki-project/elki ->-> This bundle file is used for all of the experiments that are presented in the article

-> Turkish_Ecommerce_Products_by_Gozukara_and_Ozel_2016 dataset is composed as below: ->-> Top 50 E-commerce websites that operate in Turkey are crawled, and their attributes are extracted. ->-> The crawling is made between 2015-01-13 15:12:46 ---- 2015-01-17 19:07:53 dates. ->-> Then 250 product offers from Vatanbilgisayar are randomly selected. ->-> Then the entire dataset is manually scanned to find which other products that are sold in different E-commerce websites are same as the selected ones. ->-> Then each product is classified respectively. ->-> This dataset contains these products along with their price (if available), title, categories (if available), free text description (if available), wrapped features (if available), crawled URL (the URL might have expired) attributes

-> The dataset files are provided as used in the study. -> ARFF files are generated with Raw Frequency of terms rather than used Weighting Schemes for All_Products and Only_Price_Having_Products. The reason is, we have tested these datasets with only our system and since our system does incremental clustering, even if provide TF-IDF weightings, they wouldn't be same as used in the article. More information provided in the article. ->-> For Macro_Average_Datasets we provide both Raw frequency and TF-IDF scheme weightings as used in the experiments

-> There are 3 main folders -> All_Products: This folder contains 1800 products. ->-> This is the entire collection that is manually labeled. ->-> They are from 250 different classes. -> Only_Price_Having_Products: This folder contains all of the products that have the price feature set. ->-> The collection has 1721 products from 250 classes. ->-> This is the dataset that we have experimented. -> Macro_Average_Datasets: This folder contains 100 datasets that we have used to conduct more reliable experiments. ->-> Each dataset is composed of selecting 1000 different products from the price having products dataset and then randomly ordering them...

Data	# Records
Songs	20,405
Artists	11,518
Albums	26,522
Lyrics	19,664
Acoustic Features	20,405
Genres	1,561

Facebook

Twitter

Click to copy link

Link copied

Cite

Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant (2020). An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration [Dataset]. http://doi.org/10.5281/zenodo.439362

Data from: An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration

Explore at:

csvAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.439362

Dataset updated

Jan 24, 2020

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Aakash Gautam; Saket Vishwasrao; Francisco Servant; Aakash Gautam; Saket Vishwasrao; Francisco Servant

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

A good understanding of the practices followed by software development projects can positively impact their success --- particularly for attracting talent and on-boarding new members. In this paper, we perform a cluster analysis to classify software projects that follow continuous integration in terms of their activity, popularity, size, testing, and stability. Based on this analysis, we identify and discuss four different groups of repositories that have distinct characteristics that separates them from the other groups. With this new understanding, we encourage open source projects to acknowledge and advertise their preferences according to these defining characteristics, so that they can recruit developers who share similar values.

Clear search

Close search

Google apps

Main menu

Data from: An Empirical Study of Activity, Popularity, Size, Testing, and...

The results of the statistical analysis tests.

Application Research of Clustering on kmeans

Dataset

Contents

Artificial dataset for clustering algorithms

Synthetic datasets for evaluation of dynamic data streams

PHD Thesis: Graph Set Data Mining - Clustering and Pattern Mining in the...

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

Data from: IJEE Educational Data Mining

Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems - Dataset -...

Data Mining For Business

Dataset

Contents

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Data Buffalo Toraja

Data from: Hidden Room game data clustering in University of Cadiz (Spain)...

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Data Mining Software Market Size, Share & Industry Trends Analysis 2033

Ocean Carbon States Database and Toolbox

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

SDOstreamclust Evaluation Tests

Context and methodology

Technical details

License

Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

Unsupervised Learning Report

E-Commerce Products Dataset For Record Linkage

Data from: An Empirical Study of Activity, Popularity, Size, Testing, and Stability in Continuous Integration