61 datasets found

u
Hidden Room Educational Data Mining Analysis
produccioncientifica.uca.es
figshare.com
Updated 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke (2016). Hidden Room Educational Data Mining Analysis [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d4
Explore at:
Dataset updated
2016
Authors
Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke
Description
Histograms and results of k-means and Ward's clustering for Hidden Room game

The fileset contains information from three sources:

1. Histograms files:
* Lexical_histogram.png (histogram of lexical error ratios)
* Grammatical_histogram.png (histogram of grammatical error ratios)

2. K-means clustering files:
* elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
* Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

3. Ward’s Agglomerative Hierarchical Clustering files:
* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Hidden Room game in University of Cadiz data clustering by DeutschUCA
figshare.com
png
Updated Apr 30, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Manuel Palomo-duarte; Anke Berns (2018). Hidden Room game in University of Cadiz data clustering by DeutschUCA [Dataset]. http://doi.org/10.6084/m9.figshare.6194597.v1
Explore at:
pngAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.6194597.v1
Dataset updated
Apr 30, 2018
Dataset provided by
figshare
Figsharehttp://figshare.com/
Authors
Manuel Palomo-duarte; Anke Berns
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Cádiz
Description

Histograms and results of k-means and Ward's clustering for Hidden Room game (Open Simulator) in University of Cadiz (Spain) by DeutschUCAThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:*

elbow-lex kmeans.png (clustering by lexical aspects: error curves

obtained for applying elbow method to determinate the optimal number of

clusters)* cube-lex kmeans.png (clustering by lexical aspects: a

three-dimensional representation of clusters obtained after applying

k-means method)* Lexical_clusters (table) kmeans.xls (clustering by

lexical aspects: centroids, standard deviations and number of instances

assigned to each cluster)* elbow-gram kmeans.png (clustering by

grammatical aspects: error curves obtained for applying elbow method to

determinate the optimal number of clusters)* cube-gramm kmeans.png

(clustering by grammatical aspects: a three-dimensional representation

of clusters obtained after applying k-means method)*

Grammatical_clusters (table) kmeans.xls (clustering by grammatical

aspects: centroids, standard deviations and number of instances assigned

to each cluster)* elbow-lexgram kmeans.png (clustering by lexical

and grammatical aspects: error curves obtained for applying elbow method

to determinate the optimal number of clusters)*

Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical

and grammatical aspects: centroids, standard deviations and number of

instances assigned to each cluster)*

Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls

number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls:
Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
m
Educational Attainment in North Carolina Public Schools: Use of statistical...
data.mendeley.com
Updated Nov 14, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Scott Herford (2018). Educational Attainment in North Carolina Public Schools: Use of statistical modeling, data mining techniques, and machine learning algorithms to explore 2014-2017 North Carolina Public School datasets. [Dataset]. http://doi.org/10.17632/6cm9wyd5g5.1
Explore at:
Unique identifier
https://doi.org/10.17632/6cm9wyd5g5.1
Dataset updated
Nov 14, 2018
Authors
Scott Herford
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
North Carolina
Description
The purpose of data mining analysis is always to find patterns of the data using certain kind of techiques such as classification or regression. It is not always feasible to apply classification algorithms directly to dataset. Before doing any work on the data, the data has to be pre-processed and this process normally involves feature selection and dimensionality reduction. We tried to use clustering as a way to reduce the dimension of the data and create new features. Based on our project, after using clustering prior to classification, the performance has not improved much. The reason why it has not improved could be the features we selected to perform clustering are not well suited for it. Because of the nature of the data, classification tasks are going to provide more information to work with in terms of improving knowledge and overall performance metrics. From the dimensionality reduction perspective: It is different from Principle Component Analysis which guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Using clusters as a technique of reducing the data dimension will lose a lot of information since clustering techniques are based a metric of 'distance'. At high dimensions euclidean distance loses pretty much all meaning. Therefore using clustering as a "Reducing" dimensionality by mapping data points to cluster numbers is not always good since you may lose almost all the information. From the creating new features perspective: Clustering analysis creates labels based on the patterns of the data, it brings uncertainties into the data. By using clustering prior to classification, the decision on the number of clusters will highly affect the performance of the clustering, then affect the performance of classification. If the part of features we use clustering techniques on is very suited for it, it might increase the overall performance on classification. For example, if the features we use k-means on are numerical and the dimension is small, the overall classification performance may be better. We did not lock in the clustering outputs using a random_state in the effort to see if they were stable. Our assumption was that if the results vary highly from run to run which they definitely did, maybe the data just does not cluster well with the methods selected at all. Basically, the ramification we saw was that our results are not much better than random when applying clustering to the data preprocessing. Finally, it is important to ensure a feedback loop is in place to continuously collect the same data in the same format from which the models were created. This feedback loop can be used to measure the model real world effectiveness and also to continue to revise the models from time to time as things change.
u
Data from: IJEE Educational Data Mining
produccioncientifica.uca.es
Updated 2016
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Palomo-Duarte, Manuel; Palomo-Duarte, Manuel (2016). IJEE Educational Data Mining [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde195
Explore at:
Dataset updated
2016
Authors
Palomo-Duarte, Manuel; Palomo-Duarte, Manuel
Description
Histograms and results of k-means and Ward's clustering for IJEE special issue

The fileset contains information from three sources:

1. Histograms (two files):
* Lexical_histogram.png (histogram of lexical error ratios)
* Grammatical_histogram.png (histogram of grammatical error ratios)

2. K-means clustering (eight files):
* elbow-lex.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-lex.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Lexical_clusters (table).xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-gram.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-gramm.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Grammatical_clusters (table).xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-lexgram.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* Lexical_Grammatical_clusters (table).xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
* Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

3. Ward’s Agglomerative Hierarchical Clustering (three files):
* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_clusters (table).xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
* Grammatical_clusters (table).xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_clusters (table).xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
Artificial dataset for clustering algorithms(Complete)
figshare.com
zip
Updated Sep 27, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mayra Zegarra Rodriguez; Dalcimar Casanova; Cesar Henrique Comin; Odemir M. Bruno; Diego Raphael Amancio; Luciano da Fontoura Costa; Francisco Aparecido Rodrigues (2018). Artificial dataset for clustering algorithms(Complete) [Dataset]. http://doi.org/10.6084/m9.figshare.7139510.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.7139510.v2
Dataset updated
Sep 27, 2018
Dataset provided by
figshare
Authors
Mayra Zegarra Rodriguez; Dalcimar Casanova; Cesar Henrique Comin; Odemir M. Bruno; Diego Raphael Amancio; Luciano da Fontoura Costa; Francisco Aparecido Rodrigues
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This file contains a number of randomly generated datasets. The properties of each dataset are indicated in the name of each respective file: 'C' indicates the number of classes, 'F' indicates the number of features, 'Ne' indicates the number of objects contained in each class, 'A' is related to the average separation between classes and 'R' is an index used to differentiate distinct random trials. So, for instance, the file C2F10N2Ne5A1.2R0 is a dataset containing 2 classes, 10 features, 5 objects for each class and having a typical separation between classes of 1.2. The methodology used for generating the datasets is described in the accompanying reference.
m
Data Mining Software Market Size and Projections
marketresearchintellect.com
Updated Mar 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Market Research Intellect (2025). Data Mining Software Market Size and Projections [Dataset]. https://www.marketresearchintellect.com/product/global-data-mining-software-market-size-and-forecast/
Explore at:
Dataset updated
Mar 15, 2025
Dataset authored and provided by
Market Research Intellect
License
https://www.marketresearchintellect.com/privacy-policyhttps://www.marketresearchintellect.com/privacy-policy
Area covered
Global
Description
The size and share of the market is categorized based on Type (Data extraction tools, Predictive analytics software, Text mining tools, Web mining tools, Data clustering tools) and Application (Customer insights, Market research, Trend analysis, Risk management, Pattern recognition) and geographical regions (North America, Europe, Asia-Pacific, South America, and Middle-East and Africa).
m
Lisbon, Portugal, hotel’s customer dataset with three years of personal,...
data.mendeley.com
Updated Nov 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nuno Antonio (2020). Lisbon, Portugal, hotel’s customer dataset with three years of personal, behavioral, demographic, and geographic information [Dataset]. http://doi.org/10.17632/j83f5fsh6c.1
Explore at:
Unique identifier
https://doi.org/10.17632/j83f5fsh6c.1
Dataset updated
Nov 18, 2020
Authors
Nuno Antonio
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Portugal, Lisbon
Description
Hotel customer dataset with 31 variables describing a total of 83,590 instances (customers). It comprehends three full years of customer behavioral data. In addition to personal and behavioral information, the dataset also contains demographic and geographical information. This dataset contributes to reducing the lack of real-world business data that can be used for educational and research purposes. The dataset can be used in data mining, machine learning, and other analytical field problems in the scope of data science. Due to its unit of analysis, it is a dataset especially suitable for building customer segmentation models, including clustering and RFM (Recency, Frequency, and Monetary value) models, but also be used in classification and regression problems.
Ocean Carbon States Database and Toolbox
zenodo.org
data.niaid.nih.gov
zip
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto (2020). Ocean Carbon States Database and Toolbox [Dataset]. http://doi.org/10.5281/zenodo.996892
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.996892
Dataset updated
Jan 24, 2020
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Anastasia Romanou; Rebecca Latto; Anastasia Romanou; Rebecca Latto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The "Ocean Carbon States Database and Toolbox" includes observational and climate model datasets and matlab scripts to compute regimes of the ocean carbon cycle.
d
Data from: Multi-objective optimization based privacy preserving distributed...
catalog.data.gov
data.nasa.gov
+1more
Updated Dec 7, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dashlink (2023). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks [Dataset]. https://catalog.data.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Dec 7, 2023
Dataset provided by
Dashlink
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Jun 7, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4904639
Dataset updated
Jun 7, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

The attractive features of MusicOSet include:

Integration and centralization of different musical data sources

Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018

Enriched metadata for music, artists, and albums from the US popular music industry

Availability of acoustic and lyrical resources

Unrestricted access in two formats: SQL database and compressed .csv files

| Data | # Records | |:-----------------:|:---------:| | Songs | 20,405 | | Artists | 11,518 | | Albums | 26,522 | | Lyrics | 19,664 | | Acoustic Features | 20,405 | | Genres | 1,561 |
Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems
data.nasa.gov
datasets.ai
+1more
application/rdfxml +5
Updated Jun 26, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems [Dataset]. https://data.nasa.gov/w/exfs-d3uz/default?cur=Q-fHQ7jiCL7
Explore at:
application/rssxml, application/rdfxml, tsv, csv, json, xmlAvailable download formats
Dataset updated
Jun 26, 2018
License
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Description
In a large network of computers, wireless sensors, or mobile devices, each of the components (hence, peers) has some data about the global status of the system. Many of the functions of the system, such as routing decisions, search strategies, data cleansing, and the assignment of mutual trust, depend on the global status. Therefore, it is essential that the system be able to detect, and react to, changes in its global status. Computing global predicates in such systems is usually very costly. Mainly because of their scale, and in some cases (e.g., sensor networks) also because of the high cost of communication. The cost further increases when the data changes rapidly (due to state changes, node failure, etc.) and computation has to follow these changes. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient local algorithm which detect when the L2 norm of the average data surpasses a threshold. Then, we use this algorithm as a feedback loop for the monitoring of complex predicates on the data – such as the data’s k-means clustering. The efficiency of the L2 algorithm guarantees that so long as the clustering results represent the data (i.e., the data is stationary) few resources are required. When the data undergoes an epoch change – a change in the underlying distribution – and the model no longer represents it, the feedback loop indicates this and the model is rebuilt. Furthermore, the existence of a feedback loop allows using approximate and “best-effort ” methods for constructing the model; if an ill-fit model is built the feedback loop would indicate so, and the model would be rebuilt.
f
fdata-02-00012_Identifying Travel Regions Using Location-Based Social...
figshare.com
frontiersin.figshare.com
pdf
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avradip Sen; Linus W. Dietz (2023). fdata-02-00012_Identifying Travel Regions Using Location-Based Social Network Check-in Data.pdf [Dataset]. http://doi.org/10.3389/fdata.2019.00012.s001
Explore at:
pdfAvailable download formats
Unique identifier
https://doi.org/10.3389/fdata.2019.00012.s001
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Avradip Sen; Linus W. Dietz
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Travel regions are not necessarily defined by political or administrative boundaries. For example, in the Schengen region of Europe, tourists can travel freely across borders irrespective of national borders. Identifying transboundary travel regions is an interesting problem which we aim to solve using mobility analysis of Twitter users. Our proposed solution comprises collecting geotagged tweets, combining them into trajectories and, thus, mining thousands of trips undertaken by twitter users. After aggregating these trips into a mobility graph, we apply a community detection algorithm to find coherent regions throughout the world. The discovered regions provide insights into international travel and can reveal both domestic and transnational travel regions.
d
Data from: A Generic Local Algorithm for Mining Data Streams in Large...
datasets.ai
data.nasa.gov
+2more
33
Updated Sep 21, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Aeronautics and Space Administration (2024). A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems [Dataset]. https://datasets.ai/datasets/a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems
Explore at:
33Available download formats
Dataset updated
Sep 21, 2024
Dataset authored and provided by
National Aeronautics and Space Administration
Description
In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.
f
Data_Sheet_2_Qluster: An easy-to-implement generic workflow for robust...
figshare.com
docx
Updated May 31, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker (2023). Data_Sheet_2_Qluster: An easy-to-implement generic workflow for robust clustering of health data.docx [Dataset]. http://doi.org/10.3389/frai.2022.1055294.s002
Explore at:
docxAvailable download formats
Unique identifier
https://doi.org/10.3389/frai.2022.1055294.s002
Dataset updated
May 31, 2023
Dataset provided by
Frontiers
Authors
Cyril Esnault; Melissa Rollot; Pauline Guilmin; Jean-Daniel Zucker
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.
d
SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...
b2find.dkrz.de
Updated Sep 17, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/7e9eb5b9-f166-567e-a521-f3b3be884bf2
Explore at:
Dataset updated
Sep 17, 2024
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
SDOstreamclust Evaluation Tests conducted for the paper: Stream Clustering Robust to Concept Drift Context and methodology SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans. This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. Docker A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust Technical details Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations. [data] contains datasets in ARFF format. [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper). "dependencies.sh" lists and installs python dependencies. "pysdoclust-stream-main.zip" contains the SDOstreamclust python package. "README.md" shows details and intructions to use this repository. "run.sh" runs the complete experiments. "run_comp.py"for running experiments specified by arguments. "TSindex.py" implements functions for the Temporal Silhouette index. Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.
f
Improved DBSCAN clustering algorithm.
plos.figshare.com
xls
Updated Jun 6, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Xinhuan Zhang; Les Lauber; Hongjie Liu; Junqing Shi; Jinhong Wu; Yuran Pan (2023). Improved DBSCAN clustering algorithm. [Dataset]. http://doi.org/10.1371/journal.pone.0259472.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0259472.t001
Dataset updated
Jun 6, 2023
Dataset provided by
PLOS ONE
Authors
Xinhuan Zhang; Les Lauber; Hongjie Liu; Junqing Shi; Jinhong Wu; Yuran Pan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Improved DBSCAN clustering algorithm.
Multi-objective optimization based privacy preserving distributed data...
data.staging.idas-ds1.appdat.jsc.nasa.gov
Updated Feb 19, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
data.staging.idas-ds1.appdat.jsc.nasa.gov (2025). Multi-objective optimization based privacy preserving distributed data mining in Peer-to-Peer networks - Dataset - NASA Open Data Portal [Dataset]. https://data.staging.idas-ds1.appdat.jsc.nasa.gov/dataset/multi-objective-optimization-based-privacy-preserving-distributed-data-mining-in-peer-to-p
Explore at:
Dataset updated
Feb 19, 2025
Dataset provided by
NASAhttp://nasa.gov/
Description
This paper proposes a scalable, local privacy preserving algorithm for distributed Peer-to-Peer (P2P) data aggregation useful for many advanced data mining/analysis tasks such as average/sum computation, decision tree induction, feature selection, and more. Unlike most multi-party privacy-preserving data mining algorithms, this approach works in an asynchronous manner through local interactions and it is highly scalable. It particularly deals with the distributed computation of the sum of a set of numbers stored at different peers in a P2P network in the context of a P2P web mining application. The proposed optimization based privacy-preserving technique for computing the sum allows different peers to specify different privacy requirements without having to adhere to a global set of parameters for the chosen privacy model. Since distributed sum computation is a frequently used primitive, the proposed approach is likely to have significant impact on many data mining tasks such as multi-party privacy-preserving clustering, frequent itemset mining, and statistical aggregate computation.
Z
ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...
data.niaid.nih.gov
elki-project.github.io
+1more
Updated May 2, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schubert, Erich (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6355683
Explore at:
Dataset updated
May 2, 2024
Dataset provided by
Zimek, Arthur
Schubert, Erich
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
These data sets were originally created for the following publications:

M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

The outlier data set versions were introduced in:

E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

They are derived from the original image data available at https://aloi.science.uva.nl/

The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

Additional information is available at: https://elki-project.github.io/datasets/multi_view

The following views are currently available:

Feature type Description Files Object number Sparse 1000 dimensional vectors that give the true object assignment objs.arff.gz RGB color histograms Standard RGB color histograms (uniform binning) aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz HSV color histograms Standard HSV/HSB color histograms in various binnings aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz Color similiarity Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black) aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other) Haralick features First 13 Haralick features (radius 1 pixel) aloi-haralick-1.csv.gz Front to back Vectors representing front face vs. back faces of individual objects front.arff.gz Basic light Vectors indicating basic light situations light.arff.gz Manual annotations Manually annotated object groups of semantically related objects such as cups manual1.arff.gz

Outlier Detection Versions

Additionally, we generated a number of subsets for outlier detection:

Feature type Description Files RGB Histograms Downsampled to 100000 objects (553 outliers) aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz Downsampled to 75000 objects (717 outliers) aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz Downsampled to 50000 objects (1508 outliers) aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
H
Data from: Graph Regionalization with Clustering and Partitioning: an...
dataverse.harvard.edu
search.dataone.org
Updated Sep 23, 2015
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
BENASSI FEDERICO (2015). Graph Regionalization with Clustering and Partitioning: an Application for Daily Commuting Flows in Albania [Dataset]. http://doi.org/10.7910/DVN/3AVOGY
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/3AVOGY
Dataset updated
Sep 23, 2015
Dataset provided by
Harvard Dataverse
Authors
BENASSI FEDERICO
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Albania
Description
The paper presents an original application of the recently proposed spatial data mining method named GraphRECAP on daily commuting flows using 2011 Albanian census data. Its aim is to identify several clusters of Albanian municipalities/communes; propose a classification of the Albanian territory based on daily commuting flows among municipalities/communes. Starting from 373 local units, we first applied a spatial clustering technique without imposing any constraining strategy. Based on the input variables, we obtained 16 clusters. In the second step of our analysis, we impose a set of constraining parameters to identify intermediate areas between the local level (municipality/commune) and the national one. We have defined 12 derived regions (same number as the actual Albanian prefectures but with different geographies). These derived regions are quite different from the traditional ones in terms of both geographical dimensions and boundaries.
f
Clustering using both numeric and categorical variables using data from...
plos.figshare.com
xls
Updated Jun 16, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Barbara Kachigunda; Kerrie Mengersen; Devindri I. Perera; Grey T. Coupland; Johann van der Merwe; Simon McKirdy (2023). Clustering using both numeric and categorical variables using data from Barrow Island between 2009 and 2015. [Dataset]. http://doi.org/10.1371/journal.pone.0272413.t005
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0272413.t005
Dataset updated
Jun 16, 2023
Dataset provided by
PLOS ONE
Authors
Barbara Kachigunda; Kerrie Mengersen; Devindri I. Perera; Grey T. Coupland; Johann van der Merwe; Simon McKirdy
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Barrow Island
Description
Clustering using both numeric and categorical variables using data from Barrow Island between 2009 and 2015.

Facebook

Twitter

Click to copy link

Link copied

Cite

Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke (2016). Hidden Room Educational Data Mining Analysis [Dataset]. https://produccioncientifica.uca.es/documentos/668fc475b9e7c03b01bde1d4

Hidden Room Educational Data Mining Analysis

Explore at:

Dataset updated

2016

Authors

Palomo-Duarte, Manuel; Berns, Anke; Palomo-Duarte, Manuel; Berns, Anke

Description

Histograms and results of k-means and Ward's clustering for Hidden Room game

The fileset contains information from three sources:

1. Histograms files:
* Lexical_histogram.png (histogram of lexical error ratios)
* Grammatical_histogram.png (histogram of grammatical error ratios)

2. K-means clustering files:
* elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)
* Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)
* Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)
* Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.
* Lexical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls : number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.

3. Ward’s Agglomerative Hierarchical Clustering files:
* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).
* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)
* Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.
* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.
* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.

Clear search

Close search

Google apps

Main menu

Hidden Room Educational Data Mining Analysis

Hidden Room game in University of Cadiz data clustering by DeutschUCA

Educational Attainment in North Carolina Public Schools: Use of statistical...

Data from: IJEE Educational Data Mining

Artificial dataset for clustering algorithms(Complete)

Data Mining Software Market Size and Projections

Lisbon, Portugal, hotel’s customer dataset with three years of personal,...

Ocean Carbon States Database and Toolbox

Data from: Multi-objective optimization based privacy preserving distributed...

Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

Data from: Local L2 Thresholding Based Data Mining in Peer-to-Peer Systems

fdata-02-00012_Identifying Travel Regions Using Location-Based Social...

Data from: A Generic Local Algorithm for Mining Data Streams in Large...

Data_Sheet_2_Qluster: An easy-to-implement generic workflow for robust...

SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

Improved DBSCAN clustering algorithm.

Multi-objective optimization based privacy preserving distributed data...

ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

Data from: Graph Regionalization with Clustering and Partitioning: an...

Clustering using both numeric and categorical variables using data from...

Hidden Room Educational Data Mining Analysis