100+ datasets found
  1. h

    wine-clustering

    • huggingface.co
    Updated Sep 12, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Sep 12, 2024
    Authors
    Trevor
    Description

    Wine Clustering Dataset

      Overview
    

    The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

      Dataset Structure
    

    The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.

  2. Patient Dataset for Clustering (Raw Data)

    • kaggle.com
    Updated Aug 10, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Arjunn Sharma (2023). Patient Dataset for Clustering (Raw Data) [Dataset]. https://www.kaggle.com/datasets/arjunnsharma/patient-dataset-for-clustering-raw-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 10, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Arjunn Sharma
    Description

    About Dataset ● Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. ● Three individual datasets used for three urgent illness/injury, each dataset has its own features and symptoms for each patient and we merged them to know what are the most severe symptoms for each illness and give them priority of treatment.

    PROJECT SUMMARY Triage refers to the sorting of injured or sick people according to their need for emergency medical attention. It is a method of determining priority for who gets care first. BACKGROUND Triage is the prioritization of patient care (or victims during a disaster) based on illness/injury, symptoms, severity, prognosis, and resource availability. The purpose of triage is to identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate. BUSINESS CHALLENGE Based on patient symptoms, identify patients needing immediate resuscitation; to assign patients to a predesignated patient care area, thereby prioritizing their care; and to initiate diagnostic/therapeutic measures as appropriate.

  3. Data from: Galaxy clustering

    • kaggle.com
    zip
    Updated Jan 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Devastator (2023). Galaxy clustering [Dataset]. https://www.kaggle.com/datasets/thedevastator/clustering-polygons-utilizing-iris-moon-and-circ
    Explore at:
    zip(6339 bytes)Available download formats
    Dataset updated
    Jan 3, 2023
    Authors
    The Devastator
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Galaxy clustering

    Iris, Moon, and Circles datasets for Galaxy clustering tutorial

    By [source]

    About this dataset

    This dataset contains a wealth of information that can be used to explore the effectiveness of various clustering algorithms. With its inclusion of numerical measurements (X, Y, Sepal.Length, and Petal.Length) and categorical values (Species), it is possible to investigate the relationship between different types of variables and clustering performance. Additionally, by comparing results for the 3 datasets provided - moon.csv (which contains x and y coordinates), iris.csv (which contains measurements for sepal and petal lengths),and circles.csv - we can gain insights into how different data distributions affect clustering techniques such as K-Means or Hierarchical Clustering among others!

    More Datasets

    For more datasets, click here.

    Featured Notebooks

    • 🚨 Your notebook can be here! 🚨!

    How to use the dataset

    This dataset can also be a great starting point to further explore more complex clusters by using higher dimensional space variables such as color or texture that may be present in other datasets not included here but which can help to form more accurate groups when using cluster-analysis algorithms. Additionally, it could also assist in visualization projects where clusters may need to be generated such as plotting mapped data points or examining relationships between two different variables within a certain region drawn on a chart.

    To use this dataset effectively it is important to understand how exactly your chosen algorithm works since some require specifying parameters beforehand while others take care of those details automatically; otherwise the interpretation may be invalid depending on the methods used alongside clustering you intend for your project. Furthermore, familiarize yourself with concepts like silhouette score and rand index - these are commonly used metrics that measure your cluster’s performance against other clusterings models so you know if what you have done so far satisfies an acceptable level of accuracy or not yet! Good luck!

    Research Ideas

    • Utilizing the sepal and petal lengths and widths to perform flower recognition or part of a larger image recognition pipeline.
    • Classifying the data points in each dataset by the X-Y coordinates using clustering algorithms to analyze galaxy locations or overall formation patterns for stars, planets, or galaxies.
    • Exploring correlations between species of flowers in terms of sepal/petal lengths by performing supervised learning tasks such as classification with this dataset

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. Data Source

    License

    License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.

    Columns

    File: moon.csv | Column name | Description | |:--------------|:------------------------------------------| | X | X coordinate of the data point. (Numeric) | | Y | Y coordinate of the data point. (Numeric) |

    File: iris.csv | Column name | Description | |:-----------------|:---------------------------------------------| | Sepal.Length | Length of the sepal of the flower. (Numeric) | | Petal.Length | Length of the petal of the flower. (Numeric) | | Species | Species of the flower. (Categorical) |

    Acknowledgements

    If you use this dataset in your research, please credit the original authors. If you use this dataset in your research, please credit .

  4. K Means - Data Blobs

    • figshare.com
    • datasetcatalog.nlm.nih.gov
    txt
    Updated Feb 2, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jesus Rogel-Salazar (2022). K Means - Data Blobs [Dataset]. http://doi.org/10.6084/m9.figshare.19102187.v3
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 2, 2022
    Dataset provided by
    figshare
    Figsharehttp://figshare.com/
    Authors
    Jesus Rogel-Salazar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Example data to understand the implementation of K Means

  5. student clustering

    • kaggle.com
    zip
    Updated Aug 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepansh Saxena1 (2022). student clustering [Dataset]. https://www.kaggle.com/datasets/deepanshsaxena1/student-clusteringg
    Explore at:
    zip(875 bytes)Available download formats
    Dataset updated
    Aug 31, 2022
    Authors
    Deepansh Saxena1
    Description

    Dataset

    This dataset was created by Deepansh Saxena1

    Contents

  6. Clustering Exercises

    • kaggle.com
    zip
    Updated Apr 29, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joonas (2022). Clustering Exercises [Dataset]. https://www.kaggle.com/datasets/joonasyoon/clustering-exercises
    Explore at:
    zip(3602272 bytes)Available download formats
    Dataset updated
    Apr 29, 2022
    Authors
    Joonas
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Overview

    https://i.imgur.com/ZUX61cD.png" alt="Overview">

    Context

    The method of disuniting similar data is called clustering. you can create dummy data for classifying clusters by method from sklearn package but it needs to put your effort into job.

    For users who making hard test cases for example of clustering, I think this dataset helps them.

    Try out to select a meaningful number of clusters, and dividing the data into clusters. Here are exercises for you.

    Dataset

    All csv files contain a lots of x, y and color, and you can see above figures.

    If you want to use position as type of integer, scale it and round off to integer as like x = round(x * 100).

    Furthermore, here is GUI Tool to generate 2D points for clustering. you can make your dataset with this tool. https://www.joonas.io/cluster-paint

    Stay tuned for further updates! also if any idea, you can comment me.

  7. 2D Clustering Dataset Collection

    • kaggle.com
    zip
    Updated Jan 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SAMOILOV MIKHAIL (2025). 2D Clustering Dataset Collection [Dataset]. https://www.kaggle.com/datasets/samoilovmikhail/2d-clustering-dataset-collection
    Explore at:
    zip(136543 bytes)Available download formats
    Dataset updated
    Jan 21, 2025
    Authors
    SAMOILOV MIKHAIL
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset collection comprises 15 diverse two-dimensional datasets specifically designed for clustering analysis. Each dataset contains three columns: x, y, and target, where x and y represent the coordinates of the data points, and target indicates the cluster label.

    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F20292402%2F3cc81328beabc815fe500973fee1f7ac%2Fdescription.png?generation=1737484616903723&alt=media" alt="Visualisation of data">

  8. d

    Data from: Customer segmentation in e-commerce: a context-aware quality...

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wasilewski, Adam (2023). Customer segmentation in e-commerce: a context-aware quality model for comparing clustering algorithms [Dataset]. http://doi.org/10.7910/DVN/Q1P3JV
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Wasilewski, Adam
    Description

    The dataset includes: 1. learning data containing e-commerce user sessions (DATASET-X-session_visit.csv files) 2. clustering results (including metrics values and customer clusters), per algorithm tested 3. calculations (xlsx file)

  9. H

    Replication Data for: kluster: An Efficient Scalable Procedure for...

    • dataverse.harvard.edu
    • search.dataone.org
    Updated Apr 15, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hossein Estiri (2018). Replication Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.7910/DVN/LLIOHM
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2018
    Dataset provided by
    Harvard Dataverse
    Authors
    Hossein Estiri
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z

  10. ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • zenodo.org
    • elki-project.github.io
    • +1more
    application/gzip
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. http://doi.org/10.5281/zenodo.6355684
    Explore at:
    application/gzipAvailable download formats
    Dataset updated
    May 2, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Erich Schubert; Erich Schubert; Arthur Zimek; Arthur Zimek
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2022
    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek
    Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?
    In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek
    Evaluation of Multiple Clustering Solutions
    In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel
    On Evaluation of Outlier Rankings and Outlier Scores
    In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

    Feature typeDescriptionFiles
    Object numberSparse 1000 dimensional vectors that give the true object assignmentobjs.arff.gz
    RGB color histogramsStandard RGB color histograms (uniform binning)aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    HSV color histogramsStandard HSV/HSB color histograms in various binningsaloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    Color similiarityAverage similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    Haralick featuresFirst 13 Haralick features (radius 1 pixel)aloi-haralick-1.csv.gz
    Front to backVectors representing front face vs. back faces of individual objectsfront.arff.gz
    Basic lightVectors indicating basic light situationslight.arff.gz
    Manual annotationsManually annotated object groups of semantically related objects such as cupsmanual1.arff.gz

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

    Feature typeDescriptionFiles
    RGB HistogramsDownsampled to 100000 objects (553 outliers)aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    Downsampled to 75000 objects (717 outliers)aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    Downsampled to 50000 objects (1508 outliers)aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
  11. Customer Segmentation : Clustering

    • kaggle.com
    zip
    Updated Jan 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vishakh Patel (2024). Customer Segmentation : Clustering [Dataset]. https://www.kaggle.com/datasets/vishakhdapat/customer-segmentation-clustering
    Explore at:
    zip(63448 bytes)Available download formats
    Dataset updated
    Jan 13, 2024
    Authors
    Vishakh Patel
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Customer Personality Analysis involves a thorough examination of a company's optimal customer profiles. This analysis facilitates a deeper understanding of customers, enabling businesses to tailor products to meet the distinct needs, behaviors, and concerns of various customer types.

    By conducting a Customer Personality Analysis, businesses can refine their products based on the preferences of specific customer segments. Rather than allocating resources to market a new product to the entire customer database, companies can identify the segments most likely to be interested in the product. Subsequently, targeted marketing efforts can be directed toward those particular segments, optimizing resource utilization and increasing the likelihood of successful product adoption.

    Details of Features are as below:

    • Id: Unique identifier for each individual in the dataset.
    • Year_Birth: The birth year of the individual.
    • Education: The highest level of education attained by the individual.
    • Marital_Status: The marital status of the individual.
    • Income: The annual income of the individual.
    • Kidhome: The number of young children in the household.
    • Teenhome: The number of teenagers in the household.
    • Dt_Customer: The date when the customer was first enrolled or became a part of the company's database.
    • Recency: The number of days since the last purchase or interaction.
    • MntWines: The amount spent on wines.
    • MntFruits: The amount spent on fruits.
    • MntMeatProducts: The amount spent on meat products.
    • MntFishProducts: The amount spent on fish products.
    • MntSweetProducts: The amount spent on sweet products.
    • MntGoldProds: The amount spent on gold products.
    • NumDealsPurchases: The number of purchases made with a discount or as part of a deal.
    • NumWebPurchases: The number of purchases made through the company's website.
    • NumCatalogPurchases: The number of purchases made through catalogs.
    • NumStorePurchases: The number of purchases made in physical stores.
    • NumWebVisitsMonth: The number of visits to the company's website in a month.
    • AcceptedCmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.
    • AcceptedCmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.
    • AcceptedCmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.
    • AcceptedCmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.
    • AcceptedCmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.
    • Complain: Binary indicator (1 or 0) whether the individual has made a complaint.
    • Z_CostContact: A constant cost associated with contacting a customer.
    • Z_Revenue: A constant revenue associated with a successful campaign response.
    • Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
  12. m

    Synthetic Clustering Dataset (K=20)

    • data.mendeley.com
    Updated Jan 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Lee (2020). Synthetic Clustering Dataset (K=20) [Dataset]. http://doi.org/10.17632/fgsx9hn8zh.1
    Explore at:
    Dataset updated
    Jan 18, 2020
    Authors
    Julian Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The synthetic data set has 600 points that form 20 clusters with 30 points each in 2 dimensions. The offset between a given point and its true center in each dimension is determined by Rand[0.02, 0.04] ∗ G where G is a random Gaussian number.

  13. ChatGPT API and BERT NLP

    • figshare.com
    application/csv
    Updated Mar 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Carmen Atkins (2024). ChatGPT API and BERT NLP [Dataset]. http://doi.org/10.6084/m9.figshare.25403407.v2
    Explore at:
    application/csvAvailable download formats
    Dataset updated
    Mar 13, 2024
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Carmen Atkins
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    input_prompts.csv provides the inputs for the ChatGPT API (countries and their respective prompts).topic_consolidations.csv contains the 4,018 unique topics listed across all ChatGPT responses to prompts in our study and their corresponding cluster labels after applying K-means++ clustering (n = 50) via natural language processing with Bidirectional Encoder Representations from Transformers (BERT). ChatGPT response topics come from both versions (3.5 and 4) over 10 iterations each (per each country).ChatGPT_prompt_automation.ipynb is the Jupyter notebook of Python code used to run the API to prompt ChatGPT and gather responses.topic_consolidation_BERT.ipynb is the Jupyter notebook of Python code used to process the 4,018 unique topics gathered through BERT NLP. This code was adapted from Vimal Pillar on Kaggle (https://www.kaggle.com/code/vimalpillai/text-clustering-with-sentence-bert).

  14. Z

    Dataset for "Validation of the Astro dataset clustering solutions with...

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Nov 16, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Donner, Paul (2020). Dataset for "Validation of the Astro dataset clustering solutions with external data" [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_4061693
    Explore at:
    Dataset updated
    Nov 16, 2020
    Dataset provided by
    Deutsches Zentrum für Hochschul- und Wissenschaftsforschung
    Authors
    Donner, Paul
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Validation data for the Astro scientific publication clustering benchmark dataset

    This is the dataset used in the publication Donner, P. "Validation of the Astro dataset clustering solutions with external data", Scientometrics, DOI 10.1007/s11192-020-03780-3

    Certain data included herein are derived from Clarivate Web of Science. © Copyright Clarivate 2020. All rights reserved.

    Published with permission from Clarivate.

    The original Astro dataset is not contained in this data. It can be obtained from http://topic-challenge.info/ and requires permission from Clarivate Analytics for use.

    This dataset collection consists of four files. Each file contains an independent dataset that relates to the Astro dataset via Web of Science (WoS) record identifiers. These identifiers are called UTs. All files are tabular data in CSV format. In each, at least one column contains UT data. This should be used to link to the Astro dataset or other WoS data. The datasets are discussed in detail in the journal publication.

  15. Z

    8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering...

    • data.niaid.nih.gov
    Updated Jan 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toy-Edens, Vicki (2024). 8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering plasma regions classifications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10491877
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset provided by
    Johns Hopkins University Applied Physics Laboratory
    Authors
    Toy-Edens, Vicki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“_region_list.csv”) for Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning. The 1-minute resolution file contains the rolled up 1-minute epoch, probe name (mms1, mms2, mms3, mms4), features that go into clustering and post-cleansing methods, spacecraft positions (in GSE, GSM, and magnetic latitude/local time), raw and cleansed clustering labels, and transition name. The 15+ minute region lists contain the name of the plasma region type, the probe name (mms1, mms2, mms3, mms4), and the start and stop epoch of >= 15 minute epoch where the probe is solidly within that region.

    We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (Submitted for review and publication to Journal of Geophysical Research: Space Research 1/2024 - DOI to be created upon acceptance).

    This work was funded by grant 2225463 from the NSF GEM program.

    The following tables detail the contents of the described files:

    labeled_sunside_data.csv description

    Column Name

    Description

    Epoch

    Epoch in datetime

    probe

    MMS probe name

    ratio_max_width

    Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information

    ratio_high_low

    Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information

    norm_Btot

    Magnitude of the total magnetic field normalized to 50nT. See paper for more information

    small_energy_mean

    The denominator in ratio_high_low

    large_energy_mean

    The numerator in ratio_high_low

    temp_total

    Total temperature from the DIS moments. See paper for more information

    r_gse_x

    x position of the spacecraft in GSE

    r_gse_y

    y position of the spacecraft in GSE

    r_gse_z

    z position of the spacecraft in GSE

    r_gsm_x

    x position of the spacecraft in GSM

    r_gsm_y

    y position of the spacecraft in GSM

    r_gsm_z

    z position of the spacecraft in GSM

    mlat

    magnetic latitude of spacecraft

    mlt

    magnetic local time of spacecraft

    raw_named_label

    Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)

    modified_named_label

    Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information

    transition_name

    Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information

    _region_list.csv description

    Column Name

    Description

    start

    Starting Epoch in datetime

    stop

    Stopping Epoch in datetime

    probe

    MMS probe name

    region

    Cleansed cluster name associated with 1-minute resolution “modified_named_label”

  16. Text Document Classification Dataset

    • kaggle.com
    zip
    Updated Dec 4, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sunil thite (2023). Text Document Classification Dataset [Dataset]. https://www.kaggle.com/datasets/sunilthite/text-document-classification-dataset
    Explore at:
    zip(1941393 bytes)Available download formats
    Dataset updated
    Dec 4, 2023
    Authors
    sunil thite
    Description

    This is text document classification dataset which contains 2225 text data and five categories of documents. Five categories are politics, sport, tech, entertainment and business. We can use this dataset for documents classification and document clustering.

    About Dataset - Dataset contains two features text and label. - No. of Rows : 2225 - No. of Columns : 2

    Text: It contains different categories of text data Label: It contains labels for five different categories : 0,1,2,3,4

    1. Politics = 0
    2. Sport = 1
    3. Technology = 2
    4. Entertainment =3
    5. Business = 4
  17. Student Performance and Learning Behavior Dataset for Educational Analytics

    • zenodo.org
    bin, csv
    Updated Aug 13, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kamal NAJEM; Kamal NAJEM (2025). Student Performance and Learning Behavior Dataset for Educational Analytics [Dataset]. http://doi.org/10.5281/zenodo.16459132
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Aug 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Kamal NAJEM; Kamal NAJEM
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Jul 26, 2025
    Description

    The dataset used in this study integrates quantitative data on student learning behaviors, engagement patterns, demographics, and academic performance. It was compiled by merging two publicly available Kaggle datasets, resulting in a combined file (“merged_dataset.csv”) containing 14,003 student records with 16 attributes. All records are anonymized and contain no personally identifiable information.

    The dataset covers the following categories of variables:

    • Study behaviors and engagement: StudyHours, Attendance, Extracurricular, AssignmentCompletion, OnlineCourses, Discussions
    • Resource access and learning environment: Resources, Internet, EduTech

    • Motivation and psychological factors: Motivation, StressLevel

    • Demographic information: Gender, Age (ranging from 18 to 30 years)

    • Learning preference classification: LearningStyle

    • Academic performance indicators: ExamScore, FinalGrade

    In this study, “ExamScore” and “FinalGrade” served as the primary performance indicators. The remaining variables were used to derive behavioral and contextual profiles, which were clustered using unsupervised machine learning techniques.

    The analysis and modeling were implemented in Python through a structured Jupyter Notebook (“Project.ipynb”), which included the following main steps:

    1. Environment Setup – Import of essential libraries (NumPy, pandas, Matplotlib, Seaborn, SciPy, StatsModels, scikit-learn, imbalanced-learn) and visualization configuration.

    2. Data Import and Integration – Loading the two source CSV files, harmonizing columns, removing irrelevant attributes, aligning formats, handling missing values, and merging them into a unified dataset (merged_dataset.csv).

    3. Data Preprocessing

      • Encoding categorical variables using LabelEncoder.

      • Scaling features using both z-score standardization (for statistical tests and PCA) and Min–Max normalization (for clustering).

      • Detecting and removing duplicates.

    4. Clustering Analysis

      • Applying K-Means clustering to segment learners into distinct profiles.

      • Determining the optimal number of clusters using the Elbow Method and Silhouette Score.

      • Evaluating cluster quality with internal metrics (Silhouette Score, Davies–Bouldin Index).

    5. Dimensionality Reduction & Visualization – Using PCA for 2D/3D cluster visualization and feature importance exploration.

    6. Mapping Clusters to Learning Styles – Associating each identified cluster with the most relevant learning style model based on feature patterns and alignment scores.

    7. Statistical Analysis – Conducting ANOVA and regression to test for significant differences in performance between clusters.

    8. Interpretation & Practical Recommendations – Analyzing cluster-specific characteristics and providing implications for adaptive and mobile learning integration.

  18. t

    Data for A method for assessment of the general circulation model quality...

    • data.taltech.ee
    • data.niaid.nih.gov
    Updated Mar 11, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp (2025). Data for A method for assessment of the general circulation model quality using k-means clustering algorithm [Dataset]. http://doi.org/10.5281/zenodo.4588510
    Explore at:
    Dataset updated
    Mar 11, 2025
    Dataset provided by
    TalTech Data Repository
    Authors
    Ilja Maljutenko; Ilja Maljutenko; Urmas Raudsepp; Urmas Raudsepp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    2021
    Description

    The dataset consists of simulated and observed salinity/temperature data which were used in the manuscript "A method for assessment of the general circulation model quality using k-means clustering algorithm" submitted to Geoscientific Model Development.
    The model simulation dataset is from long-term 3D circulation model simulation (Maljutenko and Raudsepp 2014, 2019). The observations are from the "Baltic Sea - Eutrophication and Acidity aggregated datasets 1902/2017 v2018" SMHI (2018).

    The files are in simple comma separated table format without headers.
    The Dout-t_z_lat_lon_Smod_Sobs_Tmod_Tobs.csv file contains columns with following variables [units]:
    Time [matlab datenum units], Vertical coordinate [m], latitude [oN], longitude [oE], model salinity [g/kg], observed salinity [g/kg], model temperature [oC], observed temperature [oC].

    The Dout-t_z_lat_lon_dS_dT_K1_K2_K3_K4_K5_K6_K7_K8_K9.csv file contains columns with following variables [units]:
    4 first columns are the same as in the previous file, salinity error [g/kg], temperature error [oC], columns 7-8 are integers showing the cluster to which the error pair is designated.

    do_clust_valid_DataFig.m is a Matlab script which reads the two csv files (and optionally mask file Model_mask.mat), performs the clustering analysis and creates plots which are used in Manuscript. The script is organized into %% blocks which can be executed separately (default: ctrl+enter).

    k-means function is used from the Matlab Statistics and Machine Learning Toolbox.

    Additional software used in the do_clust_valid_DataFig.m:

    Author's auxiliary formatting scripts script/
    datetick_cst.m
    do_fitfig.m
    do_skipticks.m
    do_skipticks_y.m

    Colormaps are generated using cbrewer.m (Charles, 2021).
    Moving average smoothing is performed using nanmoving_average.m (Aguilera, 2021).

  19. f

    Dataset of mHealth event logs

    • figshare.com
    pdf
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raoul Nuijten; Pieter Van Gorp (2022). Dataset of mHealth event logs [Dataset]. http://doi.org/10.6084/m9.figshare.19688730.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 1, 2022
    Dataset provided by
    figshare
    Authors
    Raoul Nuijten; Pieter Van Gorp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How does Facebook always seems to know what the next funny video should be to sustain your attention with the platform? Facebook has not asked you whether you like videos of cats doing something funny: They just seem to know. In fact, FaceBook learns through your behavior on the platform (e.g., how long have you engaged with similar movies, what posts have you previously liked or commented on, etc.). As a result, Facebook is able to sustain the attention of their user for a long time. On the other hand, the typical mHealth apps suffer from rapidly collapsing user engagement levels. To sustain engagement levels, mHealth apps nowadays employ all sorts of intervention strategies. Of course, it would be powerful to know—like Facebook knows—what strategy should be presented to what individual to sustain their engagement. To be able to do that, the first step could be to be able to cluster similar users (and then derive intervention strategies from there). This dataset was collected through a single mHealth app over 8 different mHealth campaigns (i.e., scientific studies). Using this dataset, one could derive clusters from app user event data. One approach could be to differentiate between two phases: a process mining phase and a clustering phase. In the process mining phase one may derive from the dataset the processes (i.e., sequences of app actions) that users undertake. In the clustering phase, based on the processes different users engaged in, one may cluster similar users (i.e., users that perform similar sequences of app actions).

    List of files

    0-list-of-variables.pdf includes an overview of different variables within the dataset. 1-description-of-endpoints.pdf includes a description of the unique endpoints that appear in the dataset. 2-requests.csv includes the dataset with actual app user event data. 2-requests-by-session.csv includes the dataset with actual app user event data with a session variable, to differentiate between user requests that were made in the same session.

  20. Z

    Data from the paper "Learning to clusterize urban areas: two competitive...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Camila Vera; Francesca Lucchini; Naim Bro; Marcelo Mendoza; Hans Lobel; Felipe Gutierrez; Jan Dimter; Gabriel Cuchacovic; Axel Reyes; Hernán Validivieso; Nicolás Alvarado; Sergio Toro (2022). Data from the paper "Learning to clusterize urban areas: two competitive approaches and an empirical validation" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6821927
    Explore at:
    Dataset updated
    Jul 12, 2022
    Dataset provided by
    Universidad Técnica Federico Santa María
    Pontificia Universidad Católica de Chile
    Universidad de Concepción
    Universidad Adolfo Ibáñez
    Authors
    Camila Vera; Francesca Lucchini; Naim Bro; Marcelo Mendoza; Hans Lobel; Felipe Gutierrez; Jan Dimter; Gabriel Cuchacovic; Axel Reyes; Hernán Validivieso; Nicolás Alvarado; Sergio Toro
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Data for urban clustering used in the paper "Learning to clusterize urban areas: two competitive approaches and an empirical validation". We release two datasets for urban clustering based on data acquired in Santiago de Chile. The first dataset is computed at the level of urban blocks. The second dataset is computed at the level of individuals using a uniform sample of Santiago inhabitants. Both datasets comprises features based on social characteristics (e.g., SES), land use, and aesthetic visual perception of the city. The features of each data unit (blocks or individuals) are provided using row packing (each row is a data unit) in CSV files. We release PCA (Principal Components Analysis) features for both datasets.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Trevor (2024). wine-clustering [Dataset]. https://huggingface.co/datasets/mltrev23/wine-clustering

wine-clustering

mltrev23/wine-clustering

Explore at:
85 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 12, 2024
Authors
Trevor
Description

Wine Clustering Dataset

  Overview

The Wine Clustering Dataset contains data on various chemical properties of wines, intended for use in clustering tasks. This dataset is ideal for exploring clustering algorithms such as K-Means, hierarchical clustering, and others, to group wines based on their chemical composition.

  Dataset Structure

The dataset is provided as a single CSV file named wine-clustering.csv. It contains 178 entries, each representing a unique wine… See the full description on the dataset page: https://huggingface.co/datasets/mltrev23/wine-clustering.

Search
Clear search
Close search
Google apps
Main menu