42 datasets found
  1. student clustering

    • kaggle.com
    zip
    Updated Aug 31, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Deepansh Saxena1 (2022). student clustering [Dataset]. https://www.kaggle.com/datasets/deepanshsaxena1/student-clusteringg
    Explore at:
    zip(875 bytes)Available download formats
    Dataset updated
    Aug 31, 2022
    Authors
    Deepansh Saxena1
    Description

    Dataset

    This dataset was created by Deepansh Saxena1

    Contents

  2. d

    Replication Data for: kluster: An Efficient Scalable Procedure for...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 22, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Estiri, Hossein (2023). Replication Data for: kluster: An Efficient Scalable Procedure for Approximating the Number of Clusters in Unsupervised Learning [Dataset]. http://doi.org/10.7910/DVN/LLIOHM
    Explore at:
    Dataset updated
    Nov 22, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Estiri, Hossein
    Description

    182 simulated datasets (first set contains small datasets and second set contains large datasets) with different cluster compositions – i.e., different number clusters and separation values – generated using clusterGeneration package in R. Each set of simulation datasets consists of 91 datasets in comma separated values (csv) format (total of 182 csv files) with 3-15 clusters and 0.1 to 0.7 separation values. Separation values can range between (−0.999, 0.999), where a higher separation value indicates cluster structure with more separable clusters. Size of the dataset, number of clusters, and separation value of the clusters in the dataset is printed in file name. size_X_n_Y_sepval_Z.csv: Size of the dataset = X number of clusters in the dataset = Y separation value of the clusters in the dataset = Z

  3. m

    Synthetic Clustering Dataset (K=20)

    • data.mendeley.com
    Updated Jan 18, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Julian Lee (2020). Synthetic Clustering Dataset (K=20) [Dataset]. http://doi.org/10.17632/fgsx9hn8zh.1
    Explore at:
    Dataset updated
    Jan 18, 2020
    Authors
    Julian Lee
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The synthetic data set has 600 points that form 20 clusters with 30 points each in 2 dimensions. The offset between a given point and its true center in each dimension is determined by Rand[0.02, 0.04] ∗ G where G is a random Gaussian number.

  4. Z

    ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of...

    • data.niaid.nih.gov
    • elki-project.github.io
    • +1more
    Updated May 2, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schubert, Erich (2024). ELKI Multi-View Clustering Data Sets Based on the Amsterdam Library of Object Images (ALOI) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6355683
    Explore at:
    Dataset updated
    May 2, 2024
    Dataset provided by
    Schubert, Erich
    Zimek, Arthur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These data sets were originally created for the following publications:

    M. E. Houle, H.-P. Kriegel, P. Kröger, E. Schubert, A. Zimek Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), Heidelberg, Germany, 2010.

    H.-P. Kriegel, E. Schubert, A. Zimek Evaluation of Multiple Clustering Solutions In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011, Athens, Greece, 2011.

    The outlier data set versions were introduced in:

    E. Schubert, R. Wojdanowski, A. Zimek, H.-P. Kriegel On Evaluation of Outlier Rankings and Outlier Scores In Proceedings of the 12th SIAM International Conference on Data Mining (SDM), Anaheim, CA, 2012.

    They are derived from the original image data available at https://aloi.science.uva.nl/

    The image acquisition process is documented in the original ALOI work: J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders, The Amsterdam library of object images, Int. J. Comput. Vision, 61(1), 103-112, January, 2005

    Additional information is available at: https://elki-project.github.io/datasets/multi_view

    The following views are currently available:

        Feature type
        Description
        Files
    
    
        Object number
        Sparse 1000 dimensional vectors that give the true object assignment
        objs.arff.gz
    
    
        RGB color histograms
        Standard RGB color histograms (uniform binning)
        aloi-8d.csv.gz aloi-27d.csv.gz aloi-64d.csv.gz aloi-125d.csv.gz aloi-216d.csv.gz aloi-343d.csv.gz aloi-512d.csv.gz aloi-729d.csv.gz aloi-1000d.csv.gz
    
    
        HSV color histograms
        Standard HSV/HSB color histograms in various binnings
        aloi-hsb-2x2x2.csv.gz aloi-hsb-3x3x3.csv.gz aloi-hsb-4x4x4.csv.gz aloi-hsb-5x5x5.csv.gz aloi-hsb-6x6x6.csv.gz aloi-hsb-7x7x7.csv.gz aloi-hsb-7x2x2.csv.gz aloi-hsb-7x3x3.csv.gz aloi-hsb-14x3x3.csv.gz aloi-hsb-8x4x4.csv.gz aloi-hsb-9x5x5.csv.gz aloi-hsb-13x4x4.csv.gz aloi-hsb-14x5x5.csv.gz aloi-hsb-10x6x6.csv.gz aloi-hsb-14x6x6.csv.gz
    
    
        Color similiarity
        Average similarity to 77 reference colors (not histograms) 18 colors x 2 sat x 2 bri + 5 grey values (incl. white, black)
        aloi-colorsim77.arff.gz (feature subsets are meaningful here, as these features are computed independently of each other)
    
    
        Haralick features
        First 13 Haralick features (radius 1 pixel)
        aloi-haralick-1.csv.gz
    
    
        Front to back
        Vectors representing front face vs. back faces of individual objects
        front.arff.gz
    
    
        Basic light
        Vectors indicating basic light situations
        light.arff.gz
    
    
        Manual annotations
        Manually annotated object groups of semantically related objects such as cups
        manual1.arff.gz
    

    Outlier Detection Versions

    Additionally, we generated a number of subsets for outlier detection:

        Feature type
        Description
        Files
    
    
        RGB Histograms
        Downsampled to 100000 objects (553 outliers)
        aloi-27d-100000-max10-tot553.csv.gz aloi-64d-100000-max10-tot553.csv.gz
    
    
    
        Downsampled to 75000 objects (717 outliers)
        aloi-27d-75000-max4-tot717.csv.gz aloi-64d-75000-max4-tot717.csv.gz
    
    
    
        Downsampled to 50000 objects (1508 outliers)
        aloi-27d-50000-max5-tot1508.csv.gz aloi-64d-50000-max5-tot1508.csv.gz
    
  5. MovieLens full 25-million recommendation data 🎬

    • kaggle.com
    Updated Apr 15, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iulia (2023). MovieLens full 25-million recommendation data 🎬 [Dataset]. https://www.kaggle.com/datasets/patriciabrezeanu/movielens-full-25-million-recommendation-data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 15, 2023
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    iulia
    Description

    Summary This dataset (ml-25m) describes a 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 25000095 ratings and 1093360 tag applications across 62423 movies. These data were created by 162541 users between January 09, 1995, and November 21, 2019. This dataset was generated on November 21, 2019. Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided. The data are contained in the files genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv, and tags.csv. More details about the contents and use of all these files follow. This and other GroupLens data sets are publicly available for download at

  6. d

    SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests...

    • b2find.dkrz.de
    Updated Sep 17, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). SDOstreamclust: Stream Clustering Robust to Concept Drift - Evaluation Tests - Dataset - B2FIND [Dataset]. https://b2find.dkrz.de/dataset/7e9eb5b9-f166-567e-a521-f3b3be884bf2
    Explore at:
    Dataset updated
    Sep 17, 2024
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    SDOstreamclust Evaluation Tests conducted for the paper: Stream Clustering Robust to Concept Drift Context and methodology SDOstreamclust is a stream clustering algorithm able to process data incrementally or per batches. It is a combination of the previous SDOstream (anomaly detection in data streams) and SDOclust (static clustering). SDOstreamclust holds the characteristics of SDO algoritmhs: lightweight, intuitive, self-adjusting, resistant to noise, capable of identifying non-convex clusters, and constructed upon robust parameters and interpretable models. Moreover, it shows excellent adaptation to concept drift In this repository, SDOclust is evaluated with 165 datasets (both synthetic and real) and compared with CluStream, DBstream, DenStream, StreamKMeans. This repository is framed within the research on the following domains: algorithm evaluation, stream clustering, unsupervised learning, machine learning, data mining, streaming data analysis. Datasets and algorithms can be used for experiment replication and for further evaluation and comparison. Docker A Docker version is also available in: https://hub.docker.com/r/fiv5/sdostreamclust Technical details Experiments are conducted in Python v3.8.14. The file and folder structure is as follows:- [algorithms] contains a script with functions related to algorithm configurations. [data] contains datasets in ARFF format. [results] contains CSV files with algorithms' performances obtained from running the "run.sh" script (as shown in the paper). "dependencies.sh" lists and installs python dependencies. "pysdoclust-stream-main.zip" contains the SDOstreamclust python package. "README.md" shows details and intructions to use this repository. "run.sh" runs the complete experiments. "run_comp.py"for running experiments specified by arguments. "TSindex.py" implements functions for the Temporal Silhouette index. Note: if codes in SDOstreamclust are modified, SWIG (v4.2.1) wrappers have to be rebuilt and SDOstreamclust consequently reinstalled with pip.

  7. P

    MNAD Dataset

    • paperswithcode.com
    Updated May 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). MNAD Dataset [Dataset]. https://paperswithcode.com/dataset/mnad
    Explore at:
    Dataset updated
    May 16, 2023
    Description

    About the MNAD Dataset The MNAD corpus is a collection of over 1 million Moroccan news articles written in modern Arabic language. These news articles have been gathered from 11 prominent electronic news sources. The dataset is made available to the academic community for research purposes, such as data mining (clustering, classification, etc.), information retrieval (ranking, search, etc.), and other non-commercial activities.

    Dataset Fields

    Title: The title of the article Body: The body of the article Category: The category of the article Source: The Electronic News paper source of the article

    About Version 1 of the Dataset (MNAD.v1) Version 1 of the dataset comprises 418,563 articles classified into 19 categories. The data was collected from well-known electronic news sources, namely Akhbarona.ma, Hespress.ma, Hibapress.com, and Le360.com. The articles were stored in four separate CSV files, each corresponding to the news website source. Each CSV file contains three fields: Title, Body, and Category of the news article.

    The dataset is rich in Arabic vocabulary, with approximately 906,125 unique words. It has been utilized as a benchmark in the research paper: "A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization". In 2021 International Conference on Decision Aid Sciences and Application (DASA).

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv1 - Huggingface Datasets: MNADv1

    About Version 2 of the Dataset (MNAD.v2) Version 2 of the MNAD dataset includes an additional 653,901 articles, bringing the total number of articles to over 1 million (1,069,489), classified into the same 19 categories as in version 1. The new documents were collected from seven additional prominent Moroccan news websites, namely al3omk.com, medi1news.com, alayam24.com, anfaspress.com, alyaoum24.com, barlamane.com, and SnrtNews.com.

    The newly collected articles have been merged with the articles from the previous version into a single CSV file named MNADv2.csv. This file includes an additional column called "Source" to indicate the source of each news article.

    Furthermore, MNAD.v2 incorporates improved pre-processing techniques and data cleaning methods. These enhancements involve removing duplicates, eliminating multiple spaces, discarding rows with NaN values, replacing new lines with " ", excluding very long and very short articles, and removing non-Arabic articles. These additions and improvements aim to enhance the usability and value of the MNAD dataset for researchers and practitioners in the field of Arabic text analysis.

    This dataset is available for download from the following sources: - Kaggle Datasets : MNADv2 - Huggingface Datasets: MNADv2

    Citation If you use our data, please cite the following paper:

    bibtex @inproceedings{MNAD2021, author = {Mourad Jbene and Smail Tigani and Rachid Saadane and Abdellah Chehri}, title = {A Moroccan News Articles Dataset ({MNAD}) For Arabic Text Categorization}, year = {2021}, publisher = {{IEEE}}, booktitle = {2021 International Conference on Decision Aid Sciences and Application ({DASA})} doi = {10.1109/dasa53625.2021.9682402}, url = {https://doi.org/10.1109/dasa53625.2021.9682402}, }

  8. Robust phenotyping of highly multiplexed tissue imaging data using...

    • zenodo.org
    zip
    Updated Jul 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo (2023). Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering (lymph node MIBI-TOF data) [Dataset]. http://doi.org/10.5281/zenodo.8096953
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jul 6, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Candace C Liu; Michael Angelo; Candace C Liu; Michael Angelo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MIBI-TOF data for lymph node dataset reported in Liu et al., Robust phenotyping of highly multiplexed tissue imaging data using pixel-level clustering

    1. mibi_single_channel_tifs.zip: Single-channel MIBI-TOF images

    Folders are labeled according to the field-of-view (FOV) number. Each folder contains single-channel TIFFs for each marker in the panel. Images are 1024x1024 pixels, 500 um. See paper for details.

    2. segmentation.zip: Segmentation output of MIBI-TOF images

    Cell segmentation was performed using Mesmer (Greenwald NF, Nature Biotechnology 2021). Output of Mesmer that delineates the single cells in each of the images is included.

    3. source_data.zip: Source data files for figures

    • pixel_ccs_allpreprocessing.csv: Cluster consistency score (CCS) for all pixels using all preprocessing steps, related to Fig. 2d-f, Supp. Fig. 4,5,9,10
    • pixel_ccs_nopixelnorm.csv: CCS for all pixels where pixel normalization was left out, related to Fig. 2f, Supp. Fig. 6
    • pixel_ccs_nochannelnorm.csv: CCS for all pixels where channel normalization was left out, related to Fig. 2f, Supp. Fig. 8
    • pixel_ccs_passes1.csv: CCS for all pixels where 1 pass was used for SOM training, related to Supp. Fig. 10
    • pixel_ccs_passes100.csv: CCS for all pixels where 100 passes were used for SOM training, related to Supp. Fig. 10
    • pixel_ccs_sigma0.csv: CCS for all pixels where a Gaussian blur sigma of 0 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_sigma1.csv: CCS for all pixels where a Gaussian blur sigma of 1 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_sigma3.csv: CCS for all pixels where a Gaussian blur sigma of 3 was used for preprocessing, related to Supp. Fig. 5
    • pixel_ccs_nodes15.csv: CCS for all pixels where 15 nodes were used for SOM training, related to Supp. Fig. 9
    • pixel_ccs_threshold80.csv: CCS for all pixels where a threshold of 80% was used for CCS calculation, related to Supp. Fig. 4b
    • pixel_ccs_threshold98.csv: CCS for all pixels where a threshold of 98% was used for CCS calculation, related to Supp. Fig. 4b
    • pixel_info_comparison_table.csv: Number of pixels that were assigned to a cluster outside of cell segmentation masks, related to Fig. 3d
    • single_cell_pixel_composition_table.csv: Pixel composition information for each single cell, related to Fig. 5, Supp. Fig 16
    • single_cell_integrated_expression_table.csv: Integrated expression per cell, output by Mesmer, related to Fig. 5, Supp. Fig. 16
    • cell_silhouette_scores.csv: Silhouette scores for comparing integrated expression and pixel composition, related to Fig. 5d
    • cell_ccs_pixel_composition.csv: CCS for all cells using pixel composition for clustering, related to Supp. Fig. 16e, 17c
    • cell_ccs_integrated_expression.csv: CCS for all cells using integrated expression for clustering, related to Supp. Fig 16e-f
    • cell_ccs_integrated_expression_preprocessed.csv: CCS for all cells using integrated expression for clustering where data was preprocessed before integrating, related to Supp. Fig 17
    • cytof_ccs.csv: CCS of the CyTOF dataset used as a benchmark, related to Supp. Fig. 4c,d
    • scrnaseq_ccs.csv: CCS of the scRNA-seq dataset used as a benchmark, related to Supp. Fig. 4c,e
    • pixel_phenotype_maps: TIFFs where pixel value corresponds to pixel cluster number as reported in the paper
    • cell_phenotype_maps: TIFFs where pixel value corresponds to cell cluster number as reported in the paper
  9. Data from: MusicOSet: An Enhanced Open Dataset for Music Data Mining

    • zenodo.org
    • data.niaid.nih.gov
    bin, zip
    Updated Jun 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota (2021). MusicOSet: An Enhanced Open Dataset for Music Data Mining [Dataset]. http://doi.org/10.5281/zenodo.4904639
    Explore at:
    zip, binAvailable download formats
    Dataset updated
    Jun 7, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Mariana O. Silva; Mariana O. Silva; Laís Mota; Mirella M. Moro; Mirella M. Moro; Laís Mota
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MusicOSet is an open and enhanced dataset of musical elements (artists, songs and albums) based on musical popularity classification. Provides a directly accessible collection of data suitable for numerous tasks in music data mining (e.g., data visualization, classification, clustering, similarity search, MIR, HSS and so forth). To create MusicOSet, the potential information sources were divided into three main categories: music popularity sources, metadata sources, and acoustic and lyrical features sources. Data from all three categories were initially collected between January and May 2019. Nevertheless, the update and enhancement of the data happened in June 2019.

    The attractive features of MusicOSet include:

    • Integration and centralization of different musical data sources
    • Calculation of popularity scores and classification of hits and non-hits musical elements, varying from 1962 to 2018
    • Enriched metadata for music, artists, and albums from the US popular music industry
    • Availability of acoustic and lyrical resources
    • Unrestricted access in two formats: SQL database and compressed .csv files
    |    Data    | # Records |
    |:-----------------:|:---------:|
    | Songs       | 20,405  |
    | Artists      | 11,518  |
    | Albums      | 26,522  |
    | Lyrics      | 19,664  |
    | Acoustic Features | 20,405  |
    | Genres      | 1,561   |
  10. The BenchStab dataset: a dataset for comparing mutational predictors of...

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Oct 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jan Velecký; Jan Velecký; Matej Berezný; Miloš Musil; Jiri Damborsky; Jiri Damborsky; David Bednar; David Bednar; Stanislav Mazurenko; Stanislav Mazurenko; Matej Berezný; Miloš Musil (2024). The BenchStab dataset: a dataset for comparing mutational predictors of stability [Dataset]. http://doi.org/10.5281/zenodo.10637728
    Explore at:
    csvAvailable download formats
    Dataset updated
    Oct 4, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jan Velecký; Jan Velecký; Matej Berezný; Miloš Musil; Jiri Damborsky; Jiri Damborsky; David Bednar; David Bednar; Stanislav Mazurenko; Stanislav Mazurenko; Matej Berezný; Miloš Musil
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is a part of BenchStab, a command-line tool for querying and benchmarking web-based protein stability predictors. We created the dataset to independently evaluate 18 structure-enabled and 4 sequence-based predictors of a stability change upon mutation. We suggest that this dataset should be excluded from training and validation of future stability predictors.

    The dataset consists of single-point mutations and their experimentally determined ΔΔG from FireProtDB, utilizing only records with both a ΔΔG measurement and a PDB accession code available. We eliminated all records similar to the data used in the training set of any of the predictors considered in BenchStab using UniRef50 clusters. This resulted in 289 records for 36 proteins, of which 28 % display a stabilizing effect (negative value of ΔΔG; see DDG distribution.png for the exact distribution). We further confirmed, by employing SCOP fold-based structure clustering, that the folds of 25 of our proteins were not present in the training sets.

    The file dataset.csv contains specifications of mutations (including the chain) and the ground truth ΔΔG reported from the literature alongside accession codes from FireProtDB (experiment ID), UniProt and Protein Data Bank, and UniRef50 cluster IDs. The file benchstab_input.csv contains the same data in the input format of the BenchStab tool.

    For more statistics and details about the dataset, please read the supplement of the paper or get in touch with us.

  11. Hierarchical Representations of Freebase Topics

    • figshare.com
    • dataverse.harvard.edu
    • +1more
    application/x-rar
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mahmoud Elbattah (2023). Hierarchical Representations of Freebase Topics [Dataset]. http://doi.org/10.6084/m9.figshare.6530825.v3
    Explore at:
    application/x-rarAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    figshare
    Authors
    Mahmoud Elbattah
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The dataset contains more than 21M hierarchical relationships about ≈10M topics extracted from Freebase knowledgebase. The topics span the various categories of Freebase including Science & Technology, Arts & Entertainment, Sports, Society, Products & Services, Transportation, Time & Space, Special Interests, and Commons. The relationships describe the hierarchies of topics in terms of Types, Domains, and Categories. For example, ‘Albert Einstein’ can be found as a topic that is a sub-class of ‘Person’, belonging to the ‘People’ domain and ‘Society’ category. While another entity named as ‘Albert Einstein’ can also be found as a sub-class of ‘Book’, belonging to the ‘Books’ domain and ‘Arts & Entertainment’ category. The dataset is published in JSON and CSV formats, sample files are provided to help explore how the dataset is structured. The dataset is believed to be useful for studying the inter-related connections among topics in different domains of knowledge. The first author may be contacted at (mahmoud.elbattah@nuigalway.ie) for more information. The following paper may kindly be cited in case of using the dataset. Mahmoud Elbattah, Mohamed Roushdy, Mostafa Aref, Abdel-Badeeh M. Salem. “Large-Scale Entity Clustering Using Graph-Based Structural Similarity within Knowledge Graphs”, Big Data Analytics: Tools, Technology for Effective Planning, CRC Press. https://www.researchgate.net/publication/321716589_Large-Scale_Entity_Clustering_Based_on_Structural_Similarity_within_Knowledge_Graphs

  12. m

    NUIG_EyeGaze01(Labelled eye gaze dataset)

    • data.mendeley.com
    Updated Feb 27, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NUIG_EyeGaze01(Labelled eye gaze dataset) [Dataset]. https://data.mendeley.com/datasets/cfm4d9y7bh/1
    Explore at:
    Dataset updated
    Feb 27, 2019
    Authors
    Anuradha Kar
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    The NUIG_EyeGaze01(Labelled eye gaze dataset) is a rich and diverse gaze dataset, built using eye gaze data from experiments done under a wide range of operating conditions from three user platforms (desktop, laptop, tablet) . Gaze data is collected under one condition at a time.

    The dataset includes gaze (fixation) data collected under 17 different head poses, 4 user distances, 6 platform poses and 3 display screen size and resolutions. Each gaze data file is labelled with the operating condition under which it was collected and has the name format: USERNUMBER_CONDITION_PLATFORM.CSV

    CONDITION: RP- Roll plus in degree PP- Pitch plus in degree YP- Yaw plus in degree

    RM- Roll minus in degree PM-Pitch minus in degree YM- Yaw minus in degree

    50, 60, 70, 80: User distances

    PLATFORM: desk- Desktop, lap- Laptop, tab- Tablet

    Desktop display: 22 inch, 1680 x1050 pixels Laptop display: 14 inch, 1366x 768 pixels Tablet display: 10.1 inch 1920 x 800, pixels

    Eye tracker accuracy: 0.5 degrees (for neutral head and tracker position)

    The dataset has 3 folders called “Desktop”, “Laptop”, “Tablet” containing gaze data from respective platforms. The Desktop folder has 2 sub-folders: user_distance and head_pose. These have data for different user distances and head poses (neutral, roll, pitch, yaw )measured with desktop setup. The Tablet folder has 2 sub-folders: user_distance and tablet_pose,. These have data for different user distances and tablet+tracker poses (neutral, roll, pitch, yaw) measured with tablet setup . The Laptop folder has one sub-folder called user_distance which has data for different user distances, measured with laptop setup.

    All data files are in CSV format. Each file contains the following data header fields:

    ("TIM REL","GTX", "GTY","XRAW", "YRAW","GT Xmm", "GT Ymm","Xmm", "Ymm","YAW GT", "YAW DATA","PITCH GT", "PITCH DATA","GAZE GT","GAZE ANG", "DIFF GZ", "AOI_IND","AOI_X","AOI_Y","MEAN_ERR","STD ERR")

    The meanings of the header fields are as follows:

    TIM REL: relative time stamp for each gaze data point (measured during data collection) "GTX", "GTY": Ground truth x, y positions in pixels "XRAW", "YRAW": Raw gaze data x, y coordinates in pixels "GT Xmm", "GT Ymm": Ground truth x, y positions in mm "Xmm", "Ymm": Gaze x, y positions in mm "YAW GT", "YAW DATA": Ground truth and estimated yaw angles "PITCH GT", "PITCH DATA": Ground truth and estimated pitch angles "GAZE GT","GAZE ANG": Ground truth and estimated gaze angles "DIFF GZ": Gaze angular accuracy "AOI_IND","AOI_X","AOI_Y": Index of the stimuli locations and their x, y coordinates "MEAN_ERR","STD ERR": Mean and standard deviation of error at the stimuli locations

    For more details on the purpose of this dataset and data collection method, please consult the paper by authors of this dataset :

    Anuradha Kar, Peter Corcoran: Performance Evaluation Strategies for Eye Gaze Estimation Systems with Quantitative Metrics and Visualizations. Sensors 18(9): 3151 (2018)

  13. Datasets for evaluating scalable supervised learning for...

    • zenodo.org
    application/gzip, bin +1
    Updated Sep 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Moayad Alnammi; Moayad Alnammi; Shengchao Liu; Shengchao Liu; Spencer S. Ericksen; Spencer S. Ericksen; Gene E. Ananiev; Andrew F. Voter; Andrew F. Voter; Song Guo; James L. Keck; James L. Keck; F. Michael Hoffmann; F. Michael Hoffmann; Scott A. Wildman; Scott A. Wildman; Anthony Gitter; Anthony Gitter; Gene E. Ananiev; Song Guo (2023). Datasets for evaluating scalable supervised learning for synthesize-on-demand chemical libraries [Dataset]. http://doi.org/10.5281/zenodo.5348291
    Explore at:
    bin, application/gzip, tsvAvailable download formats
    Dataset updated
    Sep 22, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Moayad Alnammi; Moayad Alnammi; Shengchao Liu; Shengchao Liu; Spencer S. Ericksen; Spencer S. Ericksen; Gene E. Ananiev; Andrew F. Voter; Andrew F. Voter; Song Guo; James L. Keck; James L. Keck; F. Michael Hoffmann; F. Michael Hoffmann; Scott A. Wildman; Scott A. Wildman; Anthony Gitter; Anthony Gitter; Gene E. Ananiev; Song Guo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository contains datasets for the manuscript "Evaluating scalable supervised learning for synthesize-on-demand chemical libraries":

    • ams_all_preds.csv.gz: The AMS dataset predictions when using an RF or baseline model trained on the training dataset. Includes the predicted score and rank from each model for each compound. We started with 8,434,707 AMS compounds and detected that 247,025 were in the LC or MLPCN training data. These were removed from the AMS list, leaving 8,187,682 compounds to score. The compound matching was done on the SMILES that we canonicalized in rdkit.
    • ams_order_results.csv.gz: Information about the 1,024 compounds purchased from the AMS library. Excludes the 4 AMS compounds that were incompletely dissolved. Includes the chemical feature representation, information from the vendor, RF and baseline model predictions, screening results, and clustering results.
    • baseline_weight.npy: The saved Similarity Baseline model, which consists of the active compounds in the training data. This model was used to score the AMS library. See the GitHub repository for code to load the model and make predictions on new compounds.
    • cdd_training_data.tar.gz: The LC1234 and MLPCN PriA-SSB screening data exported from CDD.
    • enamine_costs_clustered_v3_with_nneighbor.csv.gz: Contains 5,620 Enamine compounds that were selected based on the RF prediction score and availability. This file also contains the Taylor-Butina cluster ID when clustering the training compounds, 1,024 tested AMS compounds, and top-ranked Enamine compounds at a 0.4 threshold. The nearest neighbor compounds in the training and AMS sets are also included along with compound information from Enamine, RF model scores, and chemical feature representations.
    • enamine_dose_reponse_curves.tsv: The dose response curve summaries from all three runs on the 68 Enamine compounds. If a compound was tested multiple times, only the highest-quality dose response curve was used.
    • enamine_final_list.csv.gz: The final 100 filtered compounds from enamine_top_10000.csv.gz. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results.
    • enamine_PriA-SSB_dose_response_data.tar.gz: The dose response screening data from all three runs on the 68 Enamine compounds. The 2021-06-16 run was originally screened on 2020-08-24. 2021-06-16 is the date the compound identities were corrected. This run contains two 1,536 well plates.
    • enamine_top_10000.csv.gz: Top 10,000 predictions from the Enamine REAL dataset using the selected RF model. Contains compound information from Enamine as well as RF model scores, chemical feature representations, and clustering results.
    • master_df.csv.gz: The output of preprocessing the files in cdd_training_data.tar.gz. Contains 441,900 rows.
    • random_forest_classification_139.pkl: The saved RF classification model with hyperparameter ID 139. This model was used to score the AMS and Enamine REAL libraries. See the GitHub repository directory for code to load the model and make predictions on new compounds.
    • train_ams_real_cluster.csv.gz: Contains cluster IDs for Taylor-Butina clustering at a 0.4 threshold applied to the training compounds, 1,024 tested AMS compounds, and top-ranked compounds from Enamine. Includes the chemical features, dataset to which the compound belongs, leader compound for each cluster, and whether the compound is a known hit.
    • training_df_single_fold.csv.gz: This is all ten folds in training_folds.tar.gz merged for convenience. Contains 427,300 compounds.
    • training_df_single_fold_with_ams_clustering.csv.gz: Contains cluster IDs for Taylor-Butina clustering applied to the 427,300 training compounds and the 1,024 tested AMS compounds. Different clustering results are shown at the 0.2, 0.3, and 0.4 thresholds. Includes the leader compound for each cluster. Although the training and AMS compounds were clustered jointly, only the training compounds' clusters are shown. The AMS compounds' clusters are in ams_order_results.csv.gz.
    • training_folds.tar.gz: The LC1234 and MLPCN training data split into ten folds. This dataset with 427,300 compounds was used for cross validation and model selection. This dataset is derived from master_df.csv.gz.

    If you use these datasets in a publication, please cite:

    Moayad Alnammi, Shengchao Liu, Spencer S. Ericksen, Gene E. Ananiev, Andrew F. Voter, Song Guo, James L. Keck, F. Michael Hoffmann, Scott A. Wildman, Anthony Gitter. Evaluating scalable supervised learning for synthesize-on-demand chemical libraries. 2021.

    See PubChem AID 1272365 and the associated publications for the original PriA-SSB screening data.

  14. Z

    8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering...

    • data.niaid.nih.gov
    Updated Jan 11, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Toy-Edens, Vicki (2024). 8 years of dayside Magnetospheric Multiscale (MMS) unsupervised clustering plasma regions classifications [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10491877
    Explore at:
    Dataset updated
    Jan 11, 2024
    Dataset authored and provided by
    Toy-Edens, Vicki
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    These files contain the 1-minute resolution dataset (“labeled_sunside_data.csv”) and 15 minute or longer region list (“

    We ask that if you use any parts of the dataset that you cite Toy-Edens et al.'s Classifying 8 years of MMS Dayside Plasma Regions via Unsupervised Machine Learning (Submitted for review and publication to Journal of Geophysical Research: Space Research 1/2024 - DOI to be created upon acceptance).

    This work was funded by grant 2225463 from the NSF GEM program.

    The following tables detail the contents of the described files:

    labeled_sunside_data.csv description

    Column Name

    Description

    Epoch

    Epoch in datetime

    probe

    MMS probe name

    ratio_max_width

    Ratio of the width of the most prominent ion spectra peak (in number of energy channels) to max number of energy channels. See paper for more information

    ratio_high_low

    Ratio of the mean of the log intensity of high energies in the ion spectra to the mean of the log intensity of low energies in the ion spectra. See paper for more information

    norm_Btot

    Magnitude of the total magnetic field normalized to 50nT. See paper for more information

    small_energy_mean

    The denominator in ratio_high_low

    large_energy_mean

    The numerator in ratio_high_low

    temp_total

    Total temperature from the DIS moments. See paper for more information

    r_gse_x

    x position of the spacecraft in GSE

    r_gse_y

    y position of the spacecraft in GSE

    r_gse_z

    z position of the spacecraft in GSE

    r_gsm_x

    x position of the spacecraft in GSM

    r_gsm_y

    y position of the spacecraft in GSM

    r_gsm_z

    z position of the spacecraft in GSM

    mlat

    magnetic latitude of spacecraft

    mlt

    magnetic local time of spacecraft

    raw_named_label

    Raw cluster assigned plasma region label (allowed values: magnetosheath, magnetosphere, solar wind, ion foreshock)

    modified_named_label

    Cleansed cluster assigned plasma region label (use these unless have a specific reason to use raw labels). See paper for more information

    transition_name

    Transition names (e.g. quasi-perpendicular bow shock, magnetopause). See paper for more information

    Column Name

    Description

    start

    Starting Epoch in datetime

    stop

    Stopping Epoch in datetime

    probe

    MMS probe name

    region

    Cleansed cluster name associated with 1-minute resolution “modified_named_label”

  15. Reporting behavior from WHO COVID-19 public data

    • data.niaid.nih.gov
    • search.dataone.org
    • +1more
    zip
    Updated Dec 16, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Auss Abbood (2022). Reporting behavior from WHO COVID-19 public data [Dataset]. http://doi.org/10.5061/dryad.9s4mw6mmb
    Explore at:
    zipAvailable download formats
    Dataset updated
    Dec 16, 2022
    Dataset provided by
    Robert Koch Institutehttps://www.rki.de/
    Authors
    Auss Abbood
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Objective Daily COVID-19 data reported by the World Health Organization (WHO) may provide the basis for political ad hoc decisions including travel restrictions. Data reported by countries, however, is heterogeneous and metrics to evaluate its quality are scarce. In this work, we analyzed COVID-19 case counts provided by WHO and developed tools to evaluate country-specific reporting behaviors. Methods In this retrospective cross-sectional study, COVID-19 data reported daily to WHO from 3rd January 2020 until 14th June 2021 were analyzed. We proposed the concepts of binary reporting rate and relative reporting behavior and performed descriptive analyses for all countries with these metrics. We developed a score to evaluate the consistency of incidence and binary reporting rates. Further, we performed spectral clustering of the binary reporting rate and relative reporting behavior to identify salient patterns in these metrics. Results Our final analysis included 222 countries and regions. Reporting scores varied between -0.17, indicating discrepancies between incidence and binary reporting rate, and 1.0 suggesting high consistency of these two metrics. Median reporting score for all countries was 0.71 (IQR 0.55 to 0.87). Descriptive analyses of the binary reporting rate and relative reporting behavior showed constant reporting with a slight “weekend effect” for most countries, while spectral clustering demonstrated that some countries had even more complex reporting patterns. Conclusion The majority of countries reported COVID-19 cases when they did have cases to report. The identification of a slight “weekend effect” suggests that COVID-19 case counts reported in the middle of the week may represent the best data basis for political ad hoc decisions. A few countries, however, showed unusual or highly irregular reporting that might require more careful interpretation. Our score system and cluster analyses might be applied by epidemiologists advising policymakers to consider country-specific reporting behaviors in political ad hoc decisions. Methods Data collection COVID-19 data was downloaded from WHO. Using a public repository, we have added the countries' full names to the WHO data set using the two-letter abbreviations for each country to merge both data sets. The provided COVID-19 data covers January 2020 until June 2021. We uploaded the final data set used for the analyses of this paper. Data processing We processed data using a Jupyter Notebook with a Python kernel and publically available external libraries. This upload contains the required Jupyter Notebook (reporting_behavior.ipynb) with all analyses and some additional work, a README, and the conda environment yml (env.yml).

  16. Z

    Additional TAU datasets for Wi-Fi fingerprinting-based positioning

    • data.niaid.nih.gov
    • zenodo.org
    Updated May 13, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lohan (2020). Additional TAU datasets for Wi-Fi fingerprinting-based positioning [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_3819916
    Explore at:
    Dataset updated
    May 13, 2020
    Dataset authored and provided by
    Lohan
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Contents

    This document describes two datasets collected at Tampere University facilities with samples taken from a Wi-Fi network interface for experiments with indoor positioning based on Wi-Fi fingerprinting.

    To reference this dataset, please use

    E.S. Lohan et al. “Additional TAU datasets for Wi-Fi fingerprinting-based positioning” 10.5281/zenodo.3819917

    Additional reference using these datasets

    Torres-Sospedra, J.; Quezada-Gaibor, D.; Mendoza-Silva, G. M.; Nurmi, J.; Koucheryavy, Y. and Huerta, J. New Cluster Selection and Fine-grained Search for k-Means Clustering and Wi-Fi Fingerprinting Proceedings of the Tenth International Conference on Localization and GNSS (ICL-GNSS), 2020.

    Dataset format

    Two independent datasets are provided, they are in different folders, namely “Database_Building01” and “Database_Building02” respectively. Each dataset includes two sets of samples:

    radio map – a set of Wi-Fi samples collected at a grid of points (reference points);

    evaluation – a set of Wi-Fi samples randomly collected in the evaluation area.

    Two files are provided for each set that include the rss vectors and the coordinates. For the radio map, the provided files have their names starting with “rm_”; for the evaluation, the evaluation files have their names starting with “eval_”. For instance, for the radio map they are:

    rm_crd.csv: holds coordinates (x,y)and floor identifier (z) where the samples were collected;

    rm_rss.csv: holds the measured RSSI values from each of the Access Points (AP) detected in each sample;

    All the file are described in the same format, and all files are CSV – Comma Separated Values plain text (UTF-8).

    Coordinates: Each sample is associated to a pair of coordinates in a 2D Euclidean reference system. The origin of the reference system was chosen arbitrarily for convenience. The units are meters. Therefore, distances between points can be easy calculated. Moreover, the floor identifier is included to enable 3D positioning.

    RSSI values: The RSSI values provided as read from the Wi-Fi network interface through the Android API. In each sample, a value of +100 was assigned to each AP not detected during a measurement. No information is provided about the MAC addresses of the APs. However, in the files, the same order is used for all samples, meaning that the values in each column are all associated to the same AP.

    Both datasets are independent and none of the provided files include an identifier for each sample. The values in the two provided files are associated by the line number, meaning that the coordinates and RSSI values in the same line, in each file, refer to the same sample.

  17. Dataset of mHealth event logs

    • figshare.com
    pdf
    Updated May 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raoul Nuijten; Pieter Van Gorp (2022). Dataset of mHealth event logs [Dataset]. http://doi.org/10.6084/m9.figshare.19688730.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 1, 2022
    Dataset provided by
    figshare
    Authors
    Raoul Nuijten; Pieter Van Gorp
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    How does Facebook always seems to know what the next funny video should be to sustain your attention with the platform? Facebook has not asked you whether you like videos of cats doing something funny: They just seem to know. In fact, FaceBook learns through your behavior on the platform (e.g., how long have you engaged with similar movies, what posts have you previously liked or commented on, etc.). As a result, Facebook is able to sustain the attention of their user for a long time. On the other hand, the typical mHealth apps suffer from rapidly collapsing user engagement levels. To sustain engagement levels, mHealth apps nowadays employ all sorts of intervention strategies. Of course, it would be powerful to know—like Facebook knows—what strategy should be presented to what individual to sustain their engagement. To be able to do that, the first step could be to be able to cluster similar users (and then derive intervention strategies from there). This dataset was collected through a single mHealth app over 8 different mHealth campaigns (i.e., scientific studies). Using this dataset, one could derive clusters from app user event data. One approach could be to differentiate between two phases: a process mining phase and a clustering phase. In the process mining phase one may derive from the dataset the processes (i.e., sequences of app actions) that users undertake. In the clustering phase, based on the processes different users engaged in, one may cluster similar users (i.e., users that perform similar sequences of app actions).

    List of files

    0-list-of-variables.pdf includes an overview of different variables within the dataset. 1-description-of-endpoints.pdf includes a description of the unique endpoints that appear in the dataset. 2-requests.csv includes the dataset with actual app user event data. 2-requests-by-session.csv includes the dataset with actual app user event data with a session variable, to differentiate between user requests that were made in the same session.

  18. d

    Data from: Defaunation increases the spatial clustering of lowland Western...

    • datadryad.org
    • data.subak.org
    • +2more
    zip
    Updated Jan 18, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert Bagchi; Varun Swamy; Jean-Paul Latorre Farfan; John Terborgh; César I. A. Vela; Nigel C. A. Pitman; Washington Galiano Sanchez (2018). Defaunation increases the spatial clustering of lowland Western Amazonian tree communities [Dataset]. http://doi.org/10.5061/dryad.88bq8
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 18, 2018
    Dataset provided by
    Dryad
    Authors
    Robert Bagchi; Varun Swamy; Jean-Paul Latorre Farfan; John Terborgh; César I. A. Vela; Nigel C. A. Pitman; Washington Galiano Sanchez
    Time period covered
    2018
    Area covered
    South America, Madre de Dios river basin, Peru
    Description

    1.Declines of large vertebrates in tropical forests may reduce dispersal of tree species that rely on them, and the resulting undispersed seedlings might suffer increased distance- and density- dependent mortality. Consequently, extirpation of large vertebrates may alter the composition and spatial structure of plant communities and impair ecosystem functions like carbon storage.

    2.We analysed spatial patterns of tree recruitment within six forest plots along a defaunation gradient in western Amazonia. We divided recruits into two size cohorts (“saplings”, ≥1 m tall and <1 cm diameter at breast height [dbh], and juveniles, 1 – 2 cm dbh) and examined the spatial organization of conspecific recruits within each cohort (within-cohort) and around conspecific reproductive-sized trees (between-cohort). We used replicated spatial point pattern analysis to quantify relationships between recruit clustering and cohort, defaunation intensity, each tree species’ reliance on hunted dispersers an...

  19. d

    2010 County and City-Level Water-Use Data and Associated Explanatory...

    • catalog.data.gov
    • data.usgs.gov
    • +4more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). 2010 County and City-Level Water-Use Data and Associated Explanatory Variables [Dataset]. https://catalog.data.gov/dataset/2010-county-and-city-level-water-use-data-and-associated-explanatory-variables
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    This data release contains the input-data files and R scripts associated with the analysis presented in [citation of manuscript]. The spatial extent of the data is the contiguous U.S. The input-data files include one comma separated value (csv) file of county-level data, and one csv file of city-level data. The county-level csv (“county_data.csv”) contains data for 3,109 counties. This data includes two measures of water use, descriptive information about each county, three grouping variables (climate region, urban class, and economic dependency), and contains 18 explanatory variables: proportion of population growth from 2000-2010, fraction of withdrawals from surface water, average daily water yield, mean annual maximum temperature from 1970-2010, 2005-2010 maximum temperature departure from the 40-year maximum, mean annual precipitation from 1970-2010, 2005-2010 mean precipitation departure from the 40-year mean, Gini income disparity index, percent of county population with at least some college education, Cook Partisan Voting Index, housing density, median household income, average number of people per household, median age of structures, percent of renters, percent of single family homes, percent apartments, and a numeric version of urban class. The city-level csv (city_data.csv) contains data for 83 cities. This data includes descriptive information for each city, water-use measures, one grouping variable (climate region), and 6 explanatory variables: type of water bill (increasing block rate, decreasing block rate, or uniform), average price of water bill, number of requirement-oriented water conservation policies, number of rebate-oriented water conservation policies, aridity index, and regional price parity. The R scripts construct fixed-effects and Bayesian Hierarchical regression models. The primary difference between these models relates to how they handle possible clustering in the observations that define unique water-use settings. Fixed-effects models address possible clustering in one of two ways. In a "fully pooled" fixed-effects model, any clustering by group is ignored, and a single, fixed estimate of the coefficient for each covariate is developed using all of the observations. Conversely, in an unpooled fixed-effects model, separate coefficient estimates are developed only using the observations in each group. A hierarchical model provides a compromise between these two extremes. Hierarchical models extend single-level regression to data with a nested structure, whereby the model parameters vary at different levels in the model, including a lower level that describes the actual data and an upper level that influences the values taken by parameters in the lower level. The county-level models were compared using the Watanabe-Akaike information criterion (WAIC) which is derived from the log pointwise predictive density of the models and can be shown to approximate out-of-sample predictive performance. All script files are intended to be used with R statistical software (R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org) and Stan probabilistic modeling software (Stan Development Team. 2017. RStan: the R interface to Stan. R package version 2.16.2. http://mc-stan.org).

  20. Data from: Interannual variation in climate contributes to contingency in...

    • data.niaid.nih.gov
    • data.nkn.uidaho.edu
    • +2more
    zip
    Updated May 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Allison Simler-Williamson; Cara Applestein; Matthew Germino (2022). Interannual variation in climate contributes to contingency in post-fire restoration outcomes in seeded sagebrush steppe [Dataset]. http://doi.org/10.25338/B87H16
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 19, 2022
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Boise State University
    Authors
    Allison Simler-Williamson; Cara Applestein; Matthew Germino
    License

    https://spdx.org/licenses/CC0-1.0.htmlhttps://spdx.org/licenses/CC0-1.0.html

    Description

    Interannual variation, especially weather, is an often-cited reason for restoration “failures”; yet its importance is difficult to experimentally isolate across broad spatiotemporal extents, due to correlations between weather and site characteristics. In the analysis associated with this dataset, we examined post-fire treatments within sagebrush-steppe ecosystems to ask: 1) Is weather following seeding efforts a primary reason why restoration outcomes depart from predictions? and 2) Does the management-relevance of weather differ across space and with time since treatment? This dataset integrates remotely sensed estimates of sagebrush (Artemisia spp.) cover from the RCMAP product (https://www.mrlc.gov/data-services-page), areas that received post-fire seeding, identified using the Land Treatment Digital Library (https://ltdl.wr.usgs.gov/), and GridMet surface meteorological data (https://www.climatologylab.org/gridmet.html) to describe the impacts of weather on sagebrush recovery following restoration treatments. Methods This dataset integrates remotely sensed estimates of sagebrush (Artemisia spp.) cover from the RCMAP product (https://www.mrlc.gov/data-services-page), areas that received post-fire seeding, identified using the Land Treatment Digital Library (LTDL; https://ltdl.wr.usgs.gov/), and GridMet surface meteorological data (https://www.climatologylab.org/gridmet.html) to describe the impacts of weather on sagebrush recovery following restoration treatments. We identified observations from the LTDL in which at least one Artemisia species had been seeded following fire, within the extent covered by the RCMAP (NLCD back-in-time sagebrush cover) that burned between 1980 and 2005 and that were subsequently seeded. We then removed all areas that burned or were seeded multiple times between 1980 and 2015. We then selected all RCMAP pixels that overlapped these burned, seeded areas and extracted sagebrush cover for all years of the record for each pixel. Data was processed in chunks, due to the large number of pixels included in analysis. In order to reduce the data dimensions and redundancy, we next clustered the pixel data using the algorithm spatially contiguous multivariate clustering in ArcGIS. The number of clusters was set at 1/1000 of the initial number of pixels and the spatial constraint was set to contiguity edges only. The analysis fields (data attributes upon which the algorithm was run to decide on cluster membership) were elevation, TWI, heatload, Level 3 ecoregion (coded as a dummy variable), and slope. If the algorithm failed with the initial number of clusters, the number of clusters was increased by 10% until the algorithm would run. We did allow for spatial non-contiguous clusters in the case that the algorithm was not solvable with contiguous clusters only. Post-processing of the clusters included checking to make sure the relative standard error for elevation was less than 20% within a cluster and to screen for multiple fires being combined into one cluster. If multiple fires were combined into a cluster initially, they were separated into different clusters. We also assessed whether dividing the data into chunks significantly influenced the clustering process. Comparisons of the data chunks suggested that each chunk had a similar distribution of relative standard deviations for elevation, slope, heatload, and TWI among the clusters contained within it. In R, using the extract function in the raster package (Hijmans & van Etten, 2012), we extracted sagebrush cover for each year following fire and the following GridMet variables using the centerpoint of each RCMAP pixel as the point to extract to: daily precipitation, minimum temperature, and maximum temperature for the February-April in the first four years after fire, 30-year climate means, and monthly SPEI for the two years before and the four years after fire (calculated from the SPEI package in R). We extracted additional covariates for each pixel, which included elevation, TWI, heatload, Level 3 ecoregion, and slope. We then described the mean characteristics of each cluster for each of these variables. For each climate variable, we calculated each year's deviation (mean - year's observation) from the long-term (30 year) mean. This process resulted in the dataset entitled “longtermsage.csv”. For the autoregressive model (Question 3 in associated manuscript), we formatted the data to allow for statistical modeling of annual changes in sagebrush cover, in a second dataset entitled “growthannualsage.csv”. Specific variable names are described in the README file.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Deepansh Saxena1 (2022). student clustering [Dataset]. https://www.kaggle.com/datasets/deepanshsaxena1/student-clusteringg
Organization logo

student clustering

Explore at:
zip(875 bytes)Available download formats
Dataset updated
Aug 31, 2022
Authors
Deepansh Saxena1
Description

Dataset

This dataset was created by Deepansh Saxena1

Contents

Search
Clear search
Close search
Google apps
Main menu