34 datasets found
  1. g

    A Generic Local Algorithm for Mining Data Streams in Large Distributed...

    • gimi9.com
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems/
    Explore at:
    Description

    In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

  2. Data Mining For Business

    • kaggle.com
    zip
    Updated May 7, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Balal H (2022). Data Mining For Business [Dataset]. https://www.kaggle.com/datasets/balalh/data-mining-for-business
    Explore at:
    zip(142534 bytes)Available download formats
    Dataset updated
    May 7, 2022
    Authors
    Balal H
    Description

    Dataset

    This dataset was created by Balal H

    Contents

  3. f

    Data from: Data Mining Approach for Extraction of Useful Information About...

    • acs.figshare.com
    • datasetcatalog.nlm.nih.gov
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olga A. Tarasova; Nadezhda Yu. Biziukova; Dmitry A. Filimonov; Vladimir V. Poroikov; Marc C. Nicklaus (2023). Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications [Dataset]. http://doi.org/10.1021/acs.jcim.9b00164.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    ACS Publications
    Authors
    Olga A. Tarasova; Nadezhda Yu. Biziukova; Dmitry A. Filimonov; Vladimir V. Poroikov; Marc C. Nicklaus
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.

  4. Application Research of Clustering on kmeans

    • kaggle.com
    zip
    Updated Feb 27, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    ddpr raju (2021). Application Research of Clustering on kmeans [Dataset]. https://www.kaggle.com/ddprraju/tirupati-compus-school
    Explore at:
    zip(34507 bytes)Available download formats
    Dataset updated
    Feb 27, 2021
    Authors
    ddpr raju
    Description

    Dataset

    This dataset was created by ddpr raju

    Contents

  5. Online Retail-xlsx

    • kaggle.com
    zip
    Updated Sep 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    samira Qasemi (2023). Online Retail-xlsx [Dataset]. https://www.kaggle.com/datasets/samantas2020/online-retail-xlsx/code
    Explore at:
    zip(22875837 bytes)Available download formats
    Dataset updated
    Sep 10, 2023
    Authors
    samira Qasemi
    Description

    Context

    This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

    Content

    Attribute Information:

    InvoiceNo:

    Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.

    StockCode:

    Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.

    Description:

    Product (item) name. Nominal.

    Quantity:

    The quantities of each product (item) per transaction. Numeric.

    InvoiceDate:

    Invice date and time. Numeric. The day and time when a transaction was generated.

    UnitPrice:

    Unit price. Numeric. Product price per unit in sterling .

    CustomerID:

    Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.

    Country:

    Country name. Nominal. The name of the country where a customer resides.

    Acknowledgements

    Chen, D. Sain, S.L., and Guo, K. (2012), Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: [Web Link]. Chen, D., Guo, K. and Ubakanma, G. (2015), Predicting customer profitability over time based on RFM time series, International Journal of Business Forecasting and Marketing Intelligence, Vol. 2, No. 1, pp.1-18. doi: [Web Link]. Chen, D., Guo, K., and Li, Bo (2019), Predicting Customer Profitability Dynamically over Time: An Experimental Comparative Study, 24th Iberoamerican Congress on Pattern Recognition (CIARP 2019), Havana, Cuba, 28-31 Oct, 2019. Laha Ale, Ning Zhang, Huici Wu, Dajiang Chen, and Tao Han, Online Proactive Caching in Mobile Edge Computing Using Bidirectional Deep Recurrent Neural Network, IEEE Internet of Things Journal, Vol. 6, Issue 3, pp. 5520-5530, 2019. Rina Singh, Jeffrey A. Graves, Douglas A. Talbert, William Eberle, Prefix and Suffix Sequential Pattern Mining, Industrial Conference on Data Mining 2018: Advances in Data Mining. Applications and Theoretical Aspects, pp. 309-324. 2018.

  6. Forecast revenue big data market worldwide 2011-2027

    • statista.com
    Updated Mar 15, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2018). Forecast revenue big data market worldwide 2011-2027 [Dataset]. https://www.statista.com/statistics/254266/global-big-data-market-forecast/
    Explore at:
    Dataset updated
    Mar 15, 2018
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    The global big data market is forecasted to grow to 103 billion U.S. dollars by 2027, more than double its expected market size in 2018. With a share of 45 percent, the software segment would become the large big data market segment by 2027. What is Big data? Big data is a term that refers to the kind of data sets that are too large or too complex for traditional data processing applications. It is defined as having one or some of the following characteristics: high volume, high velocity or high variety. Fast-growing mobile data traffic, cloud computing traffic, as well as the rapid development of technologies such as artificial intelligence (AI) and the Internet of Things (IoT) all contribute to the increasing volume and complexity of data sets. Big data analytics Advanced analytics tools, such as predictive analytics and data mining, help to extract value from the data and generate new business insights. The global big data and business analytics market was valued at 169 billion U.S. dollars in 2018 and is expected to grow to 274 billion U.S. dollars in 2022. As of November 2018, 45 percent of professionals in the market research industry reportedly used big data analytics as a research method.

  7. t

    SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

    • researchdata.tuwien.ac.at
    zip
    Updated Nov 25, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Felix Iglesias Vazquez; Felix Iglesias Vazquez (2025). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests [Dataset]. http://doi.org/10.48436/xh0w2-q5x18
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 25, 2025
    Dataset provided by
    TU Wien
    Authors
    Felix Iglesias Vazquez; Felix Iglesias Vazquez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDOstreamclust Evaluation Tests

    conducted for the paper: Stream Clustering Robust to Concept Drift. Please refer to:

    Iglesias Vazquez, F., Konzett, S., Zseby, T., & Bifet, A. (2025). Stream Clustering Robust to Concept Drift. In 2025 International Joint Conference on Neural Networks (IJCNN) (pp. 1–10). IEEE. https://doi.org/10.1109/IJCNN64981.2025.11227664

    Context and methodology

    SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift

    In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans.

    This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison.

    Docker

    A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust

    Technical details

    Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations.

    • [data] contains datasets in ARFF format.
    • [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper).
    • "dependencies.sh" lists and installs python dependencies.
    • "pysdoclust-stream-main.zip" contains the SDOstreamclust python package.
    • "README.md" shows details and intructions to use this repository.
    • "run.sh" runs the complete experiments.
    • "run_comp.py"for running experiments specified by arguments.
    • "TSindex.py" implements functions for the Temporal Silhouette index.
    Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

    License

    The CC-BY license applies to all data generated with MDCgen. All distributed code is under the GPLv3+ license.

  8. A

    OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis...

    • data.amerigeoss.org
    • data.wu.ac.at
    html
    Updated Jul 25, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States[old] (2019). OceanXtremes: Oceanographic Data-Intensive Anomaly Detection and Analysis Portal [Dataset]. https://data.amerigeoss.org/pl/dataset/0f24d562-556c-4895-955a-74fec4cc9993
    Explore at:
    htmlAvailable download formats
    Dataset updated
    Jul 25, 2019
    Dataset provided by
    United States[old]
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Description

    Anomaly detection is a process of identifying items, events or observations, which do not conform to an expected pattern in a dataset or time series. Current and future missions and our research communities challenge us to rapidly identify features and anomalies in complex and voluminous observations to further science and improve decision support. Given this data intensive reality, we propose to develop an anomaly detection system, called OceanXtremes, powered by an intelligent, elastic Cloud-based analytic service backend that enables execution of domain-specific, multi-scale anomaly and feature detection algorithms across the entire archive of ocean science datasets. A parallel analytics engine will be developed as the key computational and data-mining core of OceanXtreams' backend processing. This analytic engine will demonstrate three new technology ideas to provide rapid turn around on climatology computation and anomaly detection: 1. An adaption of the Hadoop/MapReduce framework for parallel data mining of science datasets, typically large 3 or 4 dimensional arrays packaged in NetCDF and HDF. 2. An algorithm profiling service to efficiently and cost-effectively scale up hybrid Cloud computing resources based on the needs of scheduled jobs (CPU, memory, network, and bursting from a private Cloud computing cluster to public cloud provider like Amazon Cloud services). 3. An extension to industry-standard search solutions (OpenSearch and Faceted search) to provide support for shared discovery and exploration of ocean phenomena and anomalies, along with unexpected correlations between key measured variables. We will use a hybrid Cloud compute cluster (private Eucalyptus on-premise at JPL with bursting to Amazon Web Services) as the operational backend. The key idea is that the parallel data-mining operations will be run 'near' the ocean data archives (a local 'network' hop) so that we can efficiently access the thousands of (say, daily) files making up a three decade time-series, and then cache key variables and pre-computed climatologies in a high-performance parallel database. OceanXtremes will be equipped with both web portal and web service interfaces for users and applications/systems to register and retrieve oceanographic anomalies data. By leveraging technology such as Datacasting (Bingham, et.al, 2007), users can also subscribe to anomaly or 'event' types of their interest and have newly computed anomaly metrics and other information delivered to them by metadata feeds packaged in standard Rich Site Summary (RSS) format. Upon receiving new feed entries, users can examine the metrics and download relevant variables, by simply clicking on a link, to begin further analyzing the event. The OceanXtremes web portal will allow users to define their own anomaly or feature types where continuous backend processing will be scheduled to populate the new user-defined anomaly type by executing the chosen data mining algorithm (i.e. differences from climatology or gradients above a specified threshold). Metadata on the identified anomalies will be cataloged including temporal and geospatial profiles, key physical metrics, related observational artifacts and other relevant metadata to facilitate discovery, extraction, and visualization. Products created by the anomaly detection algorithm will be made explorable and subsettable using Webification (Huang, et.al, 2014) and OPeNDAP (http://opendap.org) technologies. Using this platform scientists can efficiently search for anomalies or ocean phenomena, compute data metrics for events or over time-series of ocean variables, and efficiently find and access all of the data relevant to their study (and then download only that data).

  9. A

    Albania Enterprises: Mining and Quarrying: Investment: Means of Transport

    • ceicdata.com
    Updated Mar 15, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2021). Albania Enterprises: Mining and Quarrying: Investment: Means of Transport [Dataset]. https://www.ceicdata.com/en/albania/enterprises-income-and-investment-by-industry-nace-2/enterprises-mining-and-quarrying-investment-means-of-transport
    Explore at:
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2012 - Dec 1, 2022
    Area covered
    Albania
    Variables measured
    Enterprises Statistics
    Description

    Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data was reported at 422.000 ALL mn in 2022. This records an increase from the previous number of 236.211 ALL mn for 2021. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data is updated yearly, averaging 363.000 ALL mn from Dec 2012 (Median) to 2022, with 11 observations. The data reached an all-time high of 1,157.821 ALL mn in 2019 and a record low of 230.000 ALL mn in 2016. Albania Enterprises: Mining and Quarrying: Investment: Means of Transport data remains active status in CEIC and is reported by Institute of Statistics. The data is categorized under Global Database’s Albania – Table AL.O011: Enterprises Income and Investment: by Industry: NACE 2.

  10. I

    Global Green Mining Market Demand and Supply Dynamics 2025-2032

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Green Mining Market Demand and Supply Dynamics 2025-2032 [Dataset]. https://www.statsndata.org/report/green-mining-market-4548
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Green Mining market is rapidly emerging as a pivotal sector, driven by the global need for sustainable mining practices. As environmental concerns intensify, industries are increasingly recognizing the importance of minimizing their ecological footprint. Green mining refers to the implementation of innovative te

  11. M

    Global Mining Laboratory Automation Market Demand and Supply Dynamics...

    • statsndata.org
    excel, pdf
    Updated Oct 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Stats N Data (2025). Global Mining Laboratory Automation Market Demand and Supply Dynamics 2025-2032 [Dataset]. https://www.statsndata.org/report/mining-laboratory-automation-market-195284
    Explore at:
    pdf, excelAvailable download formats
    Dataset updated
    Oct 2025
    Dataset authored and provided by
    Stats N Data
    License

    https://www.statsndata.org/how-to-orderhttps://www.statsndata.org/how-to-order

    Area covered
    Global
    Description

    The Mining Laboratory Automation market is witnessing robust growth, driven by the increasing demand for efficiency, precision, and safety in mineral exploration and production processes. Laboratory automation in mining refers to the use of advanced technologies and automated systems to streamline laboratory operati

  12. FakeNewsNet

    • kaggle.com
    • dataverse.harvard.edu
    zip
    Updated Nov 2, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepak Mahudeswaran (2018). FakeNewsNet [Dataset]. https://www.kaggle.com/mdepak/fakenewsnet
    Explore at:
    zip(17409594 bytes)Available download formats
    Dataset updated
    Nov 2, 2018
    Authors
    Deepak Mahudeswaran
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    FakeNewsNet

    This is a repository for an ongoing data collection project for fake news research at ASU. We describe and compare FakeNewsNet with other existing datasets in Fake News Detection on Social Media: A Data Mining Perspective. We also perform a detail analysis of FakeNewsNet dataset, and build a fake news detection model on this dataset in Exploiting Tri-Relationship for Fake News Detection

    JSON version of this dataset is available in github here. The new version of this dataset described in FakeNewNet will be published soon or you can email authors for more info.

    News Content

    It includes all the fake news articles, with the news content attributes as follows:

    1. source: It indicates the author or publisher of the news article
    2. headline: It refers to the short text that aims to catch the attention of readers and relates well to the major of the news topic.
    3. _body_text_: It elaborates the details of news story. Usually there is a major claim which shaped the angle of the publisher and is specifically highlighted and elaborated upon.
    4. _image_video_: It is an important part of body content of news article, which provides visual cues to frame the story.

    Social Context

    It includes the social engagements of fake news articles from Twitter. We extract profiles, posts and social network information for all relevant users.

    1. _user_profile_: It includes a set of profile fields that describe the users' basic information
    2. _user_content_: It collects the users' recent posts on Twitter
    3. _user_followers_: It includes the follower list of the relevant users
    4. _user_followees_: It includes list of users that are followed by relevant users

    References

    If you use this dataset, please cite the following papers:

    @article{shu2017fake, title={Fake News Detection on Social Media: A Data Mining Perspective}, author={Shu, Kai and Sliva, Amy and Wang, Suhang and Tang, Jiliang and Liu, Huan}, journal={ACM SIGKDD Explorations Newsletter}, volume={19}, number={1}, pages={22--36}, year={2017}, publisher={ACM} }

    @article{shu2017exploiting, title={Exploiting Tri-Relationship for Fake News Detection}, author={Shu, Kai and Wang, Suhang and Liu, Huan}, journal={arXiv preprint arXiv:1712.07709}, year={2017} }

    @article{shu2018fakenewsnet, title={FakeNewsNet: A Data Repository with News Content, Social Context and Dynamic Information for Studying Fake News on Social Media}, author={Shu, Kai and Mahudeswaran, Deepak and Wang, Suhang and Lee, Dongwon and Liu, Huan}, journal={arXiv preprint arXiv:1809.01286}, year={2018} }

  13. P

    Peru GDP Index: Value: Mining Extraction

    • ceicdata.com
    Updated Jul 8, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CEICdata.com (2018). Peru GDP Index: Value: Mining Extraction [Dataset]. https://www.ceicdata.com/en/peru/1994-reference-gdp-index-by-industry-annual/gdp-index-value-mining-extraction
    Explore at:
    Dataset updated
    Jul 8, 2018
    Dataset provided by
    CEICdata.com
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Dec 1, 2000 - Dec 1, 2011
    Area covered
    Peru
    Variables measured
    Gross Domestic Product
    Description

    Peru GDP Index: Value: Mining Extraction data was reported at 491.000 1994=100 in 2011. This records an increase from the previous number of 400.300 1994=100 for 2010. Peru GDP Index: Value: Mining Extraction data is updated yearly, averaging 128.500 1994=100 from Dec 1991 (Median) to 2011, with 21 observations. The data reached an all-time high of 491.000 1994=100 in 2011 and a record low of 31.500 1994=100 in 1991. Peru GDP Index: Value: Mining Extraction data remains active status in CEIC and is reported by National Institute of Statistics and Information Science. The data is categorized under Global Database’s Peru – Table PE.A030: 1994 Reference: GDP Index: by Industry: Annual.

  14. Hidden Room Educational Data Mining Analysis

    • figshare.com
    png
    Updated Sep 27, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Manuel Palomo-duarte; Anke Berns (2016). Hidden Room Educational Data Mining Analysis [Dataset]. http://doi.org/10.6084/m9.figshare.3084319.v11
    Explore at:
    pngAvailable download formats
    Dataset updated
    Sep 27, 2016
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Manuel Palomo-duarte; Anke Berns
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    Histograms and results of k-means and Ward's clustering for Hidden Room gameThe fileset contains information from three sources:1. Histograms files:* Lexical_histogram.png (histogram of lexical error ratios)* Grammatical_histogram.png (histogram of grammatical error ratios)2. K-means clustering files:* elbow-lex kmeans.png (clustering by lexical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)* cube-lex kmeans.png (clustering by lexical aspects: a three-dimensional representation of clusters obtained after applying k-means method)* Lexical_clusters (table) kmeans.xls (clustering by lexical aspects: centroids, standard deviations and number of instances assigned to each cluster)* elbow-gram kmeans.png (clustering by grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)* cube-gramm kmeans.png (clustering by grammatical aspects: a three-dimensional representation of clusters obtained after applying k-means method)* Grammatical_clusters (table) kmeans.xls (clustering by grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)* elbow-lexgram kmeans.png (clustering by lexical and grammatical aspects: error curves obtained for applying elbow method to determinate the optimal number of clusters)* Lexical_Grammatical_clusters (table) kmeans.xls (clustering by lexical and grammatical aspects: centroids, standard deviations and number of instances assigned to each cluster)* Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to grammatical error ratios.* Lexical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) kmeans.xls
    number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying k-means clustering to lexical and grammatical error ratios.3. Ward’s Agglomerative Hierarchical Clustering files:* Lexical_Cluster_Dendrogram_ward.png (clustering by lexical aspects: dendrogram obtained after applying Ward's clustering method).* Grammatical_Cluster_Dendrogram_ward.png (clustering by grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_Cluster_Dendrogram_ward.png (clustering by lexical and grammatical aspects: dendrogram obtained after applying Ward's clustering method)* Lexical_Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 7) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.* Grammatical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_clusters (table) ward.xls: Centroids (from column 2 to 4) and cluster sizes (last column) obtained by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Lexical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical error ratios.* Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to grammatical error ratios.* Lexical_Grammatical_clusters_number_of_words (table) ward.xls: number of words (from column 2 to 4) and sizes (last column) obtained per each cluster by applying Ward's agglomerative hierarchical clustering to lexical and grammatical error ratios.
  15. e

    Mining Basin (as defined in the Mining Basin Mission)

    • data.europa.eu
    ogc:wms
    Updated Aug 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Géo2France (2025). Mining Basin (as defined in the Mining Basin Mission) [Dataset]. https://data.europa.eu/data/datasets/https-www-geo2france-fr-erbm-bassin_minier/embed
    Explore at:
    ogc:wmsAvailable download formats
    Dataset updated
    Aug 17, 2025
    Dataset authored and provided by
    Géo2France
    Description

    Contour of the mining basin.

  16. Data from: Data mining the effects of testing conditions and specimen...

    • tandf.figshare.com
    docx
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu (2023). Data mining the effects of testing conditions and specimen properties on brain biomechanics [Dataset]. http://doi.org/10.6084/m9.figshare.8221103.v1
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    Taylor & Francishttps://taylorandfrancis.com/
    Authors
    Folly Patterson; Osama AbuOmar; Mike Jones; Keith Tansey; R.K. Prabhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Traumatic brain injury is highly prevalent in the United States. However, despite its frequency and significance, there is little understanding of how the brain responds during injurious loading. A confounding problem is that because testing conditions vary between assessment methods, brain biomechanics cannot be fully understood. Data mining techniques, which are commonly used to determine patterns in large datasets, were applied to discover how changes in testing conditions affect the mechanical response of the brain. Data at various strain rates were collected from published literature and sorted into datasets based on strain rate and tension vs. compression. Self-organizing maps were used to conduct a sensitivity analysis to rank the testing condition parameters by importance. Fuzzy C-means clustering was applied to determine if there were any patterns in the data. The parameter rankings and clustering for each dataset varied, indicating that the strain rate and type of deformation influence the role of these parameters in the datasets.

  17. Table_1_Assessing the Multiple Dimensions of Poverty. Data Mining Approaches...

    • frontiersin.figshare.com
    docx
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carina Källestål; Elmer Zelaya Blandón; Rodolfo Peña; Wilton Peréz; Mariela Contreras; Lars-Åke Persson; Oleg Sysoev; Katarina Ekholm Selling (2023). Table_1_Assessing the Multiple Dimensions of Poverty. Data Mining Approaches to the 2004–14 Health and Demographic Surveillance System in Cuatro Santos, Nicaragua.DOCX [Dataset]. http://doi.org/10.3389/fpubh.2019.00409.s002
    Explore at:
    docxAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    Frontiers Mediahttp://www.frontiersin.org/
    Authors
    Carina Källestål; Elmer Zelaya Blandón; Rodolfo Peña; Wilton Peréz; Mariela Contreras; Lars-Åke Persson; Oleg Sysoev; Katarina Ekholm Selling
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We identified clusters of multiple dimensions of poverty according to the capability approach theory by applying data mining approaches to the Cuatro Santos Health and Demographic Surveillance database, Nicaragua. Four municipalities in northern Nicaragua constitute the Cuatro Santos area, with 25,893 inhabitants in 5,966 households (2014). A local process analyzing poverty-related problems, prioritizing suggested actions, was initiated in 1997 and generated a community action plan 2002–2015. Interventions were school breakfasts, environmental protection, water and sanitation, preventive healthcare, home gardening, microcredit, technical training, university education stipends, and use of the Internet. In 2004, a survey of basic health and demographic information was performed in the whole population, followed by surveillance updates in 2007, 2009, and 2014 linking households and individuals. Information included the house material (floor, walls) and services (water, sanitation, electricity) as well as demographic data (birth, deaths, migration). Data on participation in interventions, food security, household assets, and women's self-rated health were collected in 2014. A K-means algorithm was used to cluster the household data (56 variables) in six clusters. The poverty ranking of household clusters using the unsatisfied basic needs index variables changed when including variables describing basic capabilities. The households in the fairly rich cluster with assets such as motorbikes and computers were described as modern. Those in the fairly poor cluster, having different degrees of food insecurity, were labeled vulnerable. Poor and poorest clusters of households were traditional, e.g., in using horses for transport. Results displayed a society transforming from traditional to modern, where the forerunners were not the richest but educated, had more working members in household, had fewer children, and were food secure. Those lagging were the poor, traditional, and food insecure. The approach may be useful for an improved understanding of poverty and to direct local policy and interventions.

  18. Prediction of Online Orders

    • kaggle.com
    zip
    Updated May 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Prediction of Online Orders [Dataset]. https://www.kaggle.com/datasets/oscarm524/prediction-of-orders/versions/3
    Explore at:
    zip(6680913 bytes)Available download formats
    Dataset updated
    May 23, 2023
    Authors
    Oscar Aguilar
    Description

    The visit of an online shop by a possible customer is also called a session. During a session the visitor clicks on products in order to see the corresponding detail page. Furthermore, he possibly will add or remove products to/from his shopping basket. At the end of a session it is possible that one or several products from the shopping basket will be ordered. The activities of the user are also called transactions. The goal of the analysis is to predict whether the visitor will place an order or not on the basis of the transaction data collected during the session.

    Tasks

    In the first task historical shop data are given consisting of the session activities inclusive of the associated information whether an order was placed or not. These data can be used in order to subsequently make order forecasts for other session activities in the same shop. Of course, the real outcome of the sessions for this set is not known. Thus, the first task can be understood as a classical data mining problem.

    The second task deals with the online scenario. In this context the participants are to implement an agent learning on the basis of transactions. That means that the agent successively receives the individual transactions and has to make a forecast for each of them with respect to the outcome of the shopping cart transaction. This task maps the practice scenario in the best possible way in the case that a transaction-based forecast is required and a corresponding algorithm should learn in an adaptive manner.

    The Data

    For the individual tasks anonymised real shop data are provided in the form of structured text files consisting of individual data sets. The data sets represent in each case transactions in the shop and may contain redundant information. For the data, in particular the following applies:

    1. Each data set is in an individual line that is closed by “LF”(“line feed”, 0xA), “CR”(“carriage return”, 0xD), or “CR”and “LF”(“carriage return”and “line feed”, 0xD and 0xA).
    2. The first line is structured analog to the data sets but contains the names of the respective columns (data arrays).
    3. The header and each data set contain several arrays separated by the symbol “|”.
    4. There is no escape character, and no quota system is used.
    5. ASCII is used as character set.
    6. There may be missing values. These are marked by the symbol “?”.

    In concrete terms, only the array names of the attached document “*features.pdf*” in their respective sequence will be used as column headings. The corresponding value ranges are listed there, too.

    The training file for task 1 is “*transact_train.txt*“) contains all data arrays of the document, whereas the corresponding classification file (“*transact_class.txt*”) of course does not contain the target attribute “*order*”.

    In task 2 data in the form of a string array are transferred to the implementations of the participants by means of a method. The individual fields of the array contain the same data arrays that are listed in “*features.pdf*”–also without the target attribute “*order*”–and exactly in the sequence used there.

    Acknowledgement

    This dataset is publicly available in the data-mining-cup-website.

  19. Data from: Wine Quality

    • kaggle.com
    • tensorflow.org
    zip
    Updated Oct 29, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniel S. Panizzo (2017). Wine Quality [Dataset]. https://www.kaggle.com/datasets/danielpanizzo/wine-quality
    Explore at:
    zip(111077 bytes)Available download formats
    Dataset updated
    Oct 29, 2017
    Authors
    Daniel S. Panizzo
    License

    http://opendatacommons.org/licenses/dbcl/1.0/http://opendatacommons.org/licenses/dbcl/1.0/

    Description

    Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009]. Please include this citation if you plan to use this database:

    P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

    1. Title: Wine Quality

    2. Sources Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

    3. Past Usage:

      P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

      In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

    4. Relevant Information:

      The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

      These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are munch more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.

    5. Number of Instances: red wine - 1599; white wine - 4898.

    6. Number of Attributes: 11 + output attribute

      Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

    7. Attribute information:

      For more information, read [Cortez et al., 2009].

      Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)

    8. Missing Attribute Values: None

    9. Description of attributes:

      1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

      2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

      3 - citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines

      4 - residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

      5 - chlorides: the amount of salt in the wine

      6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

      7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

      8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

      9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

      10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

      11 - alcohol: the percent alcohol content of the wine

      Output variable (based on sensory data): 12 - quality (score between 0 and 10)

  20. Forecasting Book Sales

    • kaggle.com
    zip
    Updated May 27, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oscar Aguilar (2023). Forecasting Book Sales [Dataset]. https://www.kaggle.com/datasets/oscarm524/forecasting-book-sales/code
    Explore at:
    zip(2246520 bytes)Available download formats
    Dataset updated
    May 27, 2023
    Authors
    Oscar Aguilar
    Description

    Because of the sheer number of products available, the German book market is one of the largest business trading today. In order to display a highly individual profile to customers and, at the same time, keep the effort involved in selecting and ordering as low as possible, the key to success for the bookshop therefore lies in the effective purchasing from a choice of roughly 96,000 new titles each year. The challenge for the bookseller is to buy the right amount of the right books at the right time.

    It is with this in mind that this year’s DATA MINING CUP Competition will be held in cooperation with Libri, Germany’s leading book wholesaler. Among Libri’s many successful support measures for booksellers, purchase recommendations give the bookshop a competitive advantage. Accordingly, the DATA MINING CUP 2009 challenge will be to forecast of purchase quantities of a clearly defined title portfolio per location, using simulated data.

    The Task

    The task of the DATA MINING CUP Competition 2009 is to forecast purchase quantities for 8 titles for 2,418 different locations. In order to create the model, simulated purchase data from an additional 2,394 locations will be supplied. All data refers to a fixed period of time. The object is to forecast the purchase quantities of these 8 different titles for the 2,418 locations as exactly as possible.

    The Data

    There are two text files available to assist in solving the problem: dmc2009_train.txt (train data file) and dmc2009_forecast.txt (data of 2,418 locations for whom a prediction is to be made).

    Acknowledgement

    This data is publicly available in the data-mining-website.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems | gimi9.com [Dataset]. https://gimi9.com/dataset/data-gov_a-generic-local-algorithm-for-mining-data-streams-in-large-distributed-systems/

A Generic Local Algorithm for Mining Data Streams in Large Distributed Systems | gimi9.com

Explore at:
Description

In a large network of computers or wireless sensors, each of the components (henceforth, peers) has some data about the global state of the system. Much of the system's functionality such as message routing, information retrieval and load sharing relies on modeling the global state. We refer to the outcome of the function (e.g., the load experienced by each peer) as the emph{model} of the system. Since the state of the system is constantly changing, it is necessary to keep the models up-to-date. Computing global data mining models e.g. decision trees, k-means clustering in large distributed systems may be very costly due to the scale of the system and due to communication cost, which may be high. The cost further increases in a dynamic scenario when the data changes rapidly. In this paper we describe a two step approach for dealing with these costs. First, we describe a highly efficient emph{local} algorithm which can be used to monitor a wide class of data mining models. Then, we use this algorithm as a feedback loop for the monitoring of complex functions of the data such as its k-means clustering. The theoretical claims are corroborated with a thorough experimental analysis.

Search
Clear search
Close search
Google apps
Main menu