100+ datasets found
  1. P

    Film (60%/20%/20% random splits) Dataset

    • library.toponeai.link
    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Film (60%/20%/20% random splits) Dataset [Dataset]. https://library.toponeai.link/dataset/film-60-20-20-random-splits
    Explore at:
    Description

    Node classification on Film with 60%/20%/20% random splits for training/validation/test.

  2. d

    Training dataset for NABat Machine Learning V1.0

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Jul 6, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2024). Training dataset for NABat Machine Learning V1.0 [Dataset]. https://catalog.data.gov/dataset/training-dataset-for-nabat-machine-learning-v1-0
    Explore at:
    Dataset updated
    Jul 6, 2024
    Dataset provided by
    U.S. Geological Survey
    Description

    Bats play crucial ecological roles and provide valuable ecosystem services, yet many populations face serious threats from various ecological disturbances. The North American Bat Monitoring Program (NABat) aims to assess status and trends of bat populations while developing innovative and community-driven conservation solutions using its unique data and technology infrastructure. To support scalability and transparency in the NABat acoustic data pipeline, we developed a fully-automated machine-learning algorithm. This dataset includes audio files of bat echolocation calls that were considered to develop V1.0 of the NABat machine-learning algorithm, however the test set (i.e., holdout dataset) has been excluded from this release. These recordings were collected by various bat monitoring partners across North America using ultrasonic acoustic recorders for stationary acoustic and mobile acoustic surveys. For more information on how these surveys may be conducted, see Chapters 4 and 5 of “A Plan for the North American Bat Monitoring Program” (https://doi.org/10.2737/SRS-GTR-208). These data were then post-processed by bat monitoring partners to remove noise files (or those that do not contain recognizable bat calls) and apply a species label to each file. There is undoubtedly variation in the steps that monitoring partners take to apply a species label, but the steps documented in “A Guide to Processing Bat Acoustic Data for the North American Bat Monitoring Program” (https://doi.org/10.3133/ofr20181068) include first processing with an automated classifier and then manually reviewing to confirm or downgrade the suggested species label. Once a manual ID label was applied, audio files of bat acoustic recordings were submitted to the NABat database in Waveform Audio File format. From these available files in the NABat database, we considered files from 35 classes (34 species and a noise class). Files for 4 species were excluded due to low sample size (Corynorhinus rafinesquii, N=3; Eumops floridanus, N =3; Lasiurus xanthinus, N = 4; Nyctinomops femorosaccus, N =11). From this pool, files were randomly selected until files for each species/grid cell combination were exhausted or the number of recordings reach 1250. The dataset was then randomly split into training, validation, and test sets (i.e., holdout dataset). This data release includes all files considered for training and validation, including files that had been excluded from model development and testing due to low sample size for a given species or because the threshold for species/grid cell combinations had been met. The test set (i.e., holdout dataset) is not included. Audio files are grouped by species, as indicated by the four-letter species code in the name of each folder. Definitions for each four-letter code, including Family, Genus, Species, and Common name, are also included as a dataset in this release.

  3. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  4. Pubmed Journal Recommendation System dataset

    • zenodo.org
    • data.niaid.nih.gov
    csv
    Updated Dec 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; RaĂşl GarcĂ­a Castro; RaĂşl GarcĂ­a Castro; Jiayun Liu (2023). Pubmed Journal Recommendation System dataset [Dataset]. http://doi.org/10.5281/zenodo.8386011
    Explore at:
    csvAvailable download formats
    Dataset updated
    Dec 18, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jiayun Liu; Manuel Castillo Cara; Manuel Castillo Cara; RaĂşl GarcĂ­a Castro; RaĂşl GarcĂ­a Castro; Jiayun Liu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset for Journal recommendation, includes title, abstract, keywords, and journal.

    We extracted the journals and more information of:

    Jiasheng Sheng. (2022). PubMed-OA-Extraction-dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6330817.

    Dataset Components:

    • data_pubmed_all: This dataset encompasses all articles, each containing the following columns: 'pubmed_id', 'title', 'keywords', 'journal', 'abstract', 'conclusions', 'methods', 'results', 'copyrights', 'doi', 'publication_date', 'authors', 'AKE_pubmed_id', 'AKE_pubmed_title', 'AKE_abstract', 'AKE_keywords', 'File_Name'.

    • data_pubmed: To focus on recent and relevant publications, we have filtered this dataset to include articles published within the last five years, from January 1, 2018, to December 13, 2022—the latest date in the dataset. Additionally, we have exclusively retained journals with more than 200 published articles, resulting in 262,870 articles from 469 different journals.

    • data_pubmed_train, data_pubmed_val, and data_pubmed_test: For machine learning and model development purposes, we have partitioned the 'data_pubmed' dataset into three subsets—training, validation, and test—using a random 60/20/20 split ratio. Notably, this division was performed on a per-journal basis, ensuring that each journal's articles are proportionally represented in the training (60%), validation (20%), and test (20%) sets. The resulting partitions consist of 157,540 articles in the training set, 52,571 articles in the validation set, and 52,759 articles in the test set.

  5. Z

    Data from: Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes,...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 6, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4903957
    Explore at:
    Dataset updated
    Jun 6, 2021
    Dataset provided by
    Kauwe, K. Steven
    Sparks, D. Taylor
    Henderson, N. Ashley
    Description

    This benchmark data is comprised of 50 different datasets for materials properties obtained from 16 previous publications. The data contains both experimental and computational data, data suited for regression as well as classification, sizes ranging from 12 to 6354 samples, and materials systems spanning the diversity of materials research. In addition to cleaning the data where necessary, each dataset was split into train, validation, and test splits.

    For datasets with more than 100 values, train-val-test splits were created, either with a 5-fold or 10-fold cross-validation method, depending on what each respective paper did in their studies. Datasets with less than 100 values had train-test splits created using the Leave-One-Out cross-validation method.

    For further information, as well as directions on how to access the data, please go to the corresponding GitHub repository: https://github.com/anhender/mse_ML_datasets/tree/v1.0

  6. P

    WDC LSPM Dataset

    • paperswithcode.com
    Updated May 31, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    WDC LSPM Dataset [Dataset]. https://paperswithcode.com/dataset/wdc-products
    Explore at:
    Dataset updated
    May 31, 2022
    Description

    Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

    In order to support the evaluation of machine learning-based matching methods, the data is split into training, validation and test sets. For each product category, we provide training sets in four different sizes (2.000-70.000 pairs). Furthermore there are sets of ids for each training set for a possible validation split (stratified random draw) available. The test set for each product category consists of 1.100 product pairs. The labels of the test sets were manually checked while those of the training sets were derived using shared product identifiers from the Web via weak supervision.

    The data stems from the WDC Product Data Corpus for Large-Scale Product Matching - Version 2.0 which consists of 26 million product offers originating from 79 thousand websites.

  7. R

    Data from: Fashion Mnist Dataset

    • universe.roboflow.com
    • opendatalab.com
    • +4more
    zip
    Updated Aug 10, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Popular Benchmarks (2022). Fashion Mnist Dataset [Dataset]. https://universe.roboflow.com/popular-benchmarks/fashion-mnist-ztryt/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 10, 2022
    Dataset authored and provided by
    Popular Benchmarks
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Clothing
    Description

    Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

    Authors:

    Dataset Obtained From: https://github.com/zalandoresearch/fashion-mnist

    All images were sized 28x28 in the original dataset

    Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits. * Source

    Here's an example of how the data looks (each class takes three-rows): https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">

    Version 1 (original-images_Original-FashionMNIST-Splits):

    • Original images, with the original splits for MNIST: train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.
    • This version was not trained

    Version 3 (original-images_trainSetSplitBy80_20):

    • Original, raw images, with the train set split to provide 80% of its images to the training set and 20% of its images to the validation set
    • https://blog.roboflow.com/train-test-split/ https://i.imgur.com/angfheJ.png" alt="Train/Valid/Test Split Rebalancing">

    Citation:

    @online{xiao2017/online,
     author    = {Han Xiao and Kashif Rasul and Roland Vollgraf},
     title    = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
     date     = {2017-08-28},
     year     = {2017},
     eprintclass = {cs.LG},
     eprinttype  = {arXiv},
     eprint    = {cs.LG/1708.07747},
    }
    
  8. Data from: Time-Split Cross-Validation as a Method for Estimating the...

    • acs.figshare.com
    txt
    Updated Jun 2, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Robert P. Sheridan (2023). Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction. [Dataset]. http://doi.org/10.1021/ci400084k.s001
    Explore at:
    txtAvailable download formats
    Dataset updated
    Jun 2, 2023
    Dataset provided by
    ACS Publications
    Authors
    Robert P. Sheridan
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Cross-validation is a common method to validate a QSAR model. In cross-validation, some compounds are held out as a test set, while the remaining compounds form a training set. A model is built from the training set, and the test set compounds are predicted on that model. The agreement of the predicted and observed activity values of the test set (measured by, say, R2) is an estimate of the self-consistency of the model and is sometimes taken as an indication of the predictivity of the model. This estimate of predictivity can be optimistic or pessimistic compared to true prospective prediction, depending how compounds in the test set are selected. Here, we show that time-split selection gives an R2 that is more like that of true prospective prediction than the R2 from random selection (too optimistic) or from our analog of leave-class-out selection (too pessimistic). Time-split selection should be used in addition to random selection as a standard for cross-validation in QSAR model building.

  9. Z

    ProxyFAUG: Proximity-based Fingerprint Augmentation (data)

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jun 13, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anagnostopoulos, Grigorios (2022). ProxyFAUG: Proximity-based Fingerprint Augmentation (data) [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_4457390
    Explore at:
    Dataset updated
    Jun 13, 2022
    Dataset provided by
    Kalousis, Alexandros
    Anagnostopoulos, Grigorios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The supplementary data of the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation".

    Open access Author’s accepted manuscript version: https://arxiv.org/abs/2102.02706v2

    Published paper: https://ieeexplore.ieee.org/document/9662590

    The train/validation/test sets used in the paper "ProxyFAUG: Proximity-based Fingerprint Augmentation", after having passed the preprocessing process described in the paper, are made available here. Moreover, the augmentations produced by the proposed ProxyFAUG method are also made available with the files (x_aug_train.csv, y_aug_train.csv). More specifically:

    x_train_pre.csv : The features side (x) information of the preprocessed training set.

    x_val_pre.csv : The features side (x) information of the preprocessed validation set.

    x_test_pre.csv : The features side (x) information of the preprocessed test set.

    x_aug_train.csv : The features side (x) information of the fingerprints generated by ProxyFAUG.

    y_train.csv : The location ground truth information (y) of the training set.

    y_val.csv : The location ground truth information (y) of the validation set.

    y_test.csv : The location ground truth information (y) of the test set.

    y_aug_train.csv : The location ground truth information (y) of the fingerprints generated by ProxyFAUG.

    Note that in the paper, the original training set (x_train_pre.csv) is used as a baseline, and is compared against the scenario where the concatenation of the original and the generated training sets (concatenation of x_train_pre.csv and x_aug_train.csv) is used.

    The full code implementation related to the paper is available here:

    Code: https://zenodo.org/record/4457353

    The original full dataset used in this study, is the public dataset sigfox_dataset_antwerp.csv which can be access here:

    https://zenodo.org/record/3904158#.X4_h7y8RpQI

    The above link is related to the publication "Sigfox and LoRaWAN Datasets for Fingerprint Localization in Large Urban and Rural Areas", in which the original full dataset was published. The publication is available here:

    http://www.mdpi.com/2306-5729/3/2/13

    The credit for the creation of the original full dataset goes to Aernouts, Michiel; Berkvens, Rafael; Van Vlaenderen, Koen; and Weyn, Maarten.

    The train/validation/test split of the original dataset that is used in this paper, is taken from our previous work "A Reproducible Analysis of RSSI Fingerprinting for Outdoors Localization Using Sigfox: Preprocessing and Hyperparameter Tuning". Using the same train/validation/test split in different works strengthens the consistency of the comparison of results. All relevant material of that work is listed below:

    Preprint: https://arxiv.org/abs/1908.06851

    Paper: https://ieeexplore.ieee.org/document/8911792

    Code: https://zenodo.org/record/3228752

    Data: https://zenodo.org/record/3228744

  10. wcep_dense_oracle

    • huggingface.co
    Updated Dec 12, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ai2 (2023). wcep_dense_oracle [Dataset]. https://huggingface.co/datasets/allenai/wcep_dense_oracle
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Dec 12, 2023
    Dataset provided by
    Allen Institute for AIhttp://allenai.org/
    Authors
    Ai2
    License

    https://choosealicense.com/licenses/other/https://choosealicense.com/licenses/other/

    Description

    This is a copy of the WCEP-10 dataset, except the input source documents of the train, validation, and test splits have been replaced by a dense retriever. The retrieval pipeline used:

    query: The summary field of each example corpus: The union of all documents in the train, validation and test splits retriever: facebook/contriever-msmarco via PyTerrier with default settings top-k strategy: "oracle", i.e. the number of documents retrieved, k, is set as the original number of input documents… See the full description on the dataset page: https://huggingface.co/datasets/allenai/wcep_dense_oracle.

  11. R

    License Plates Dataset

    • universe.roboflow.com
    zip
    Updated Oct 15, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Samrat Sahoo (2022). License Plates Dataset [Dataset]. https://universe.roboflow.com/samrat-sahoo/license-plates-f8vsn/model/3
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 15, 2022
    Dataset authored and provided by
    Samrat Sahoo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Variables measured
    Plates Bounding Boxes
    Description

    Overview

    The License Plates dataset is a object detection dataset of different vehicles (i.e. cars, vans, etc.) and their respective license plate. Annotations also include examples of "vehicle" and "license-plate". This dataset has a train/validation/test split of 245/70/35 respectively. https://i.imgur.com/JmRgjBq.png" alt="Dataset Example">

    Use Cases

    This dataset could be used to create a vehicle and license plate detection object detection model. Roboflow provides a great guide on creating a license plate and vehicle object detection model.

    Using this Dataset

    This dataset is a subset of the Open Images Dataset. The annotations are licensed by Google LLC under CC BY 4.0 license. Some annotations have been combined or removed using Roboflow's annotation management tools to better align the annotations with the purpose of the dataset. The images have a CC BY 2.0 license.

    About Roboflow

    Roboflow creates tools that make computer vision easy to use for any developer, even if you're not a machine learning expert. You can use it to organize, label, inspect, convert, and export your image datasets. And even to train and deploy computer vision models with no code required. https://i.imgur.com/WHFqYSJ.png" alt="https://roboflow.com">

  12. Web Data Commons Phones Dataset, Augmented Version, Fixed Splits

    • linkagelibrary.icpsr.umich.edu
    delimited
    Updated Nov 23, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anna Primpeli; Christian Bizer (2020). Web Data Commons Phones Dataset, Augmented Version, Fixed Splits [Dataset]. http://doi.org/10.3886/E127243V1
    Explore at:
    delimitedAvailable download formats
    Dataset updated
    Nov 23, 2020
    Dataset provided by
    University of Mannheim (Germany)
    Authors
    Anna Primpeli; Christian Bizer
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Motivation: Entity Matching is the task of determining which records from different data sources describe the same real-world entity. It is an important task for data integration and has been the focus of many research works. A large number of entity matching/record linkage tasks has been made available for evaluating entity matching methods. However, the lack of fixed development and test splits as well as correspondence sets including both matching and non-matching record pairs hinders the reproducibility and comparability of benchmark experiments. In an effort to enhance the reproducibility and comparability of the experiments, we complement existing entity matching benchmark tasks with fixed sets of non-matching pairs as well as fixed development and test splits. Dataset Description: An augmented version of the wdc phones dataset for benchmarking entity matching/record linkage methods found at:http://webdatacommons.org/productcorpus/index.html#toc4 The augmented version adds fixed splits for training, validation and testing as well as their corresponding feature vectors. The feature vectors are built using data type specific similarity metrics.The dataset contains 447 records describing products deriving from 17 e-shops which are matched against a product catalog of 50 products. The gold standards have manual annotations for 258 matching and 22,092 non-matching pairs. The total number of attributes used to decribe the product records are 26 while the attribute density is 0.25. The augmented dataset enhances the reproducibility of matching methods and the comparability of matching results. The dataset is part of the CompERBench repository which provides 21 complete benchmark tasks for entity matching for public download: http://data.dws.informatik.uni-mannheim.de/benchmarkmatchingtasks/index.html

  13. u

    Gaussian Process kernels comparison - Datasets and python code

    • figshare.unimelb.edu.au
    bin
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiabo Lu; Niels Fraehr; QJ Wang; Xiaohua Xiang; Xiaoling Wu (2024). Gaussian Process kernels comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/26087719.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Jiabo Lu; Niels Fraehr; QJ Wang; Xiaohua Xiang; Xiaoling Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewData used for publication in "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". We investigate the impact of 13 Gaussian Process (GP) kernels, consisting of five single kernels and eight composite kernels, on the prediction accuracy and computational efficiency of the Low-fidelity, Spatial analysis, and Gaussian process learning (LSG) modelling approach. The GP kernels are compared for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia). The high- and low-fidelity model simulation results are obtained from the data repository Fraehr, N. (2024, January 19). Surrogate flood model comparison - Datasets and python code (Version 1). The University of Melbourne. https://doi.org/10.26188/24312658.v1.Dataset structureThe dataset is structured in 5 file folders:CarlisleChowillaBurnettRVComparison_resultsPython_dataThe first three folders contain simulation data and analysis codes. The "Comparison_results" folder contains plotting codes, figures and tables for comparison results. The "Python_data" folder contains LSG model functions and Python environment requirement.Carlisle, Chowilla, and BurnettRVThese files contain high- and low-fidelity hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the LSG model with different GP kernels in each case study. There are only small differences between each folder, depending on the hydrodynamic model simulation results and EOF analysis results.Each case study file has the following folders:Geometry_dataDEM files.npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model).shp files indicating location of boundaries and main flow pathsXXX_modeldataFolder to storage trained model data for each XXX kernel LSG model. For example, EXP_modeldata contains files used to store the trainined LSG model using exponential Gaussian Process kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.HD_model_dataHigh-fidelity simulation results for all flood events of that case studyLow-fidelity simulation results for all flood events of that case studyAll boundary input conditionsHF_EOF_analysisStoring of data used in the EOF analysis for the LSG model.Results_dataStoring results of running the evaluation of the LSG models with different GP kernel candidates.Train_test_split_dataThe train-test-validation data split is the same for all LSG models with different GP kernel candidates. The specific split for each cross-validation fold is stored in this folder.YYY_event_summary.csv, YYY_Extrap_event_summary.csvFiles containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing.py, EOF_analysis_HFdata.pyPreprocessing before EOF analysis and the EOF analysis of the high-fidelity data.Evaluation.py, Evaluation_extrap.pyScripts for evaluating the LSG model for that case study and saving the results for each cross-validation fold.train_test_split.pyScript for splitting the flood datasets for each cross-validation fold, so all LSG models with different GP kernel candidates train on the same data.XXX_training.pyScript for training each LSG model using the XXX GP kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.XXX_training.batBatch scripts for training all LSG models using different GP kernel candidates.Comparison_resultsFiles used for comparing LSG models using different GP kernel candidates and generate the figures in the paper "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". Figures are also included.Python_dataFolder containing Python script with utility functions for setting up, training, and running the LSG models, as well as for evaluating the LSG models. Python environmentThis folder also contains two python environment file with all Python package versions and dependencies. You can install CPU version or GPU version of environment. GPU version environment can use GPU to speed up the GPflow training process. It will install cuda and CUDnn package.You can choose to install environment online or offline. Offline installation reduces dependency issues, but it requires that you also use the same Windows 10 operating system as I do.Online installationLSG_CPU_environment.yml: python environment for running LSG models using CPU of the computerLSG_GPU_environment.yml: python environment for running LSG models using GPU of the computer, mainly using GPU to speed up the GPflow training process. It need to install cuda and CUDnn package.In the directory where the .yml file is located, use the console to enter the following commandconda env create -f LSG_CPU_environment.yml -n myenv_nameorconda env create -f LSG_GPU_environment.yml -n myenv_nameOffline installationIf you also use Windows 10 system as I do, you can directly unzip environment packed by conda-pack.LSG_CPU.tar.gz: Zip file containing all packages in the virtual environment for CPU onlyLSG_GPU.tar.gz: Zip file containing all packages in the virtual environment for GPU accelerationIn Windows system, create a new LSG_CPU or LSG_GPU folder in the Anaconda environment folder and extract the packaged LSG_CPU.tar.gz or LSG_GPU.tar.gz file into that folder.tar -xzvf LSG_CPU.tar.gz -C ./LSG_CPUortar -xzvf LSG_GPU.tar.gz -C ./LSG_GPUAccess to the environment pathcd ./LSG_GPUactivation environment.\Scripts\activate.batRemove prefixes from the activation environment.\Scripts\conda-unpack.exeExit environment.\Scripts\deactivate.batLSG_mods_and_funcPython scripts for using the LSG model.Evaluation_metrics.pyMetrics used to evaluate the prediction accuracy and computational efficiency of the LSG models.

  14. OGBN-Products (Processed for PyG)

    • kaggle.com
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Products (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbn-products/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Redao da Taupl
    Description

    OGBN-Products

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products

    Usage in Python

    import os.path as osp
    import pandas as pd
    import datatable as dt
    import torch
    import torch_geometric as pyg
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProducts(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-products', None
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProducts()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

    Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

    Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

    Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.

    Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.12,449,02961,859,140Sales rankMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 257–266, 2019. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    License: Amazon License

    By accessing the Amazon Customer Reviews Library ("Reviews Library"), you agree that the Reviews Library is an Amazon Service subject to the Amazon.com Conditions of Use (https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088) and you agree to be bound by them, with the following additional conditions: In addition to the license rights granted under the Conditions of Use, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable, revocable license to access and use the Reviews Library for purposes of academic research. You may not resell, republish, or make any commercial use of the Reviews Library or its contents, including use of the Reviews Library for commercial research, such as research related to a funding or consultancy contract, internship, or other relationship in which the results are provided for a fee or delivered to a for-profit organization. You may not (a) link or associate content in the Reviews Library with any personal information (including Amazon customer accounts), or (b) attempt to determine the identity of the author of any content in the Reviews Library. If you violate any of the foregoing conditions, your license to access and use the Reviews Library will automatically terminate without prejudice to any of the other rights or remedies Amazon may have.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  15. Data for the manuscript "Spatially resolved uncertainties for machine...

    • zenodo.org
    • data.niaid.nih.gov
    application/gzip, bin +1
    Updated May 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. http://doi.org/10.5281/zenodo.11093925
    Explore at:
    application/gzip, bin, text/x-pythonAvailable download formats
    Dataset updated
    May 3, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Esther Heid; Esther Heid; Johannes Schörghuber; Johannes Schörghuber; Ralf Wanzenböck; Ralf Wanzenböck; Georg K. H. Madsen; Georg K. H. Madsen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

    • mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

    • aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

    • data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

    • data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

    • data_h2o.tar.gz contains:

      • full_db.extxyz: The full dataset of 1.5k structures.

      • iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

      • the subfolders in the folders random and uncertain contain the training and validation sets for the random and uncertainty-based active learning loops.

  16. P

    MS COCO Dataset

    • paperswithcode.com
    Updated Apr 15, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár, MS COCO Dataset [Dataset]. https://paperswithcode.com/dataset/coco
    Explore at:
    Dataset updated
    Apr 15, 2024
    Authors
    Tsung-Yi Lin; Michael Maire; Serge Belongie; Lubomir Bourdev; Ross Girshick; James Hays; Pietro Perona; Deva Ramanan; C. Lawrence Zitnick; Piotr Dollár
    Description

    The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.

    Splits: The first version of MS COCO dataset was released in 2014. It contains 164K images split into training (83K), validation (41K) and test (41K) sets. In 2015 additional test set of 81K images was released, including all the previous test images and 40K new images.

    Based on community feedback, in 2017 the training/validation split was changed from 83K/41K to 118K/5K. The new split uses the same images and annotations. The 2017 test set is a subset of 41K images of the 2015 test set. Additionally, the 2017 release contains a new unannotated dataset of 123K images.

    Annotations: The dataset has annotations for

    object detection: bounding boxes and per-instance segmentation masks with 80 object categories, captioning: natural language descriptions of the images (see MS COCO Captions), keypoints detection: containing more than 200,000 images and 250,000 person instances labeled with keypoints (17 possible keypoints, such as left eye, nose, right hip, right ankle), stuff image segmentation – per-pixel segmentation masks with 91 stuff categories, such as grass, wall, sky (see MS COCO Stuff), panoptic: full scene segmentation, with 80 thing categories (such as person, bicycle, elephant) and a subset of 91 stuff categories (grass, sky, road), dense pose: more than 39,000 images and 56,000 person instances labeled with DensePose annotations – each labeled person is annotated with an instance id and a mapping between image pixels that belong to that person body and a template 3D model. The annotations are publicly available only for training and validation images.

  17. f

    Number of insect classes in the dataset used for training, validation and...

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kim Bjerge; Jamie Alison; Mads Dyrmann; Carsten Eie Frigaard; Hjalte M. R. Mann; Toke Thomas Høye (2023). Number of insect classes in the dataset used for training, validation and test. [Dataset]. http://doi.org/10.1371/journal.pstr.0000051.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS Sustainability and Transformation
    Authors
    Kim Bjerge; Jamie Alison; Mads Dyrmann; Carsten Eie Frigaard; Hjalte M. R. Mann; Toke Thomas Høye
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Insects were annotated from the images with Sedum plants. The background class contains images without any insects. A train and validation dataset with six species (6cls) was used to train the first YOLO model. A benchmark training and validation dataset with an approximate split percentages of 85/15 with nine species (9cls) was used to train the final evaluated YOLO models. The test component of the benchmark dataset contains additional unfamiliar classes 10–19 that contain other species of bees, flies and beetles, unclear insects and other animals. Classes 1–9 are listed in brackets next to their closest relative of the unfamiliar classes 10–19.

  18. h

    code-knowledge-eval

    • huggingface.co
    Updated Oct 26, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    code-knowledge-eval [Dataset]. https://huggingface.co/datasets/kimsan0622/code-knowledge-eval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 26, 2024
    Authors
    san kim
    Description

    Code Knowledge Value Evaluation Dataset

    This dataset is created by evaluating the knowledge value of code sourced from the bigcode/the-stack repository. It is designed to assess the educational and knowledge potential of different code samples.

      Dataset Overview
    

    The dataset is split into training, validation, and test sets with the following number of samples:

    Training set: 22,786 samples Validation set: 4,555 samples Test set: 18,232 samples

      Usage… See the full description on the dataset page: https://huggingface.co/datasets/kimsan0622/code-knowledge-eval.
    
  19. m

    pinterest_dataset

    • data.mendeley.com
    Updated Oct 27, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    pinterest_dataset [Dataset]. https://data.mendeley.com/datasets/fs4k2zc5j5/2
    Explore at:
    Dataset updated
    Oct 27, 2017
    Authors
    Juan Carlos Gomez
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Dataset with 72000 pins from 117 users in Pinterest. Each pin contains a short raw text and an image. The images are processed using a pretrained Convolutional Neural Network and transformed into a vector of 4096 features.

    This dataset was used in the paper "User Identification in Pinterest Through the Refinement of a Cascade Fusion of Text and Images" to idenfity specific users given their comments. The paper is publishe in the Research in Computing Science Journal, as part of the LKE 2017 conference. The dataset includes the splits used in the paper.

    There are nine files. text_test, text_train and text_val, contain the raw text of each pin in the corresponding split of the data. imag_test, imag_train and imag_val contain the image features of each pin in the corresponding split of the data. train_user and val_test_users contain the index of the user of each pin (between 0 and 116). There is a correspondance one-to-one among the test, train and validation files for images, text and users. There are 400 pins per user in the train set, and 100 pins per user in the validation and test sets each one.

    If you have questions regarding the data, write to: jc dot gomez at ugto dot mx

  20. a

    Tree Point Classification - New Zealand

    • geoportal-pacificcore.hub.arcgis.com
    • pacificgeoportal.com
    • +1more
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://geoportal-pacificcore.hub.arcgis.com/content/0e2e3d0d0ef843e690169cac2f5620f9
    Explore at:
    Dataset updated
    Jul 25, 2022
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Film (60%/20%/20% random splits) Dataset [Dataset]. https://library.toponeai.link/dataset/film-60-20-20-random-splits

Film (60%/20%/20% random splits) Dataset

Explore at:
Description

Node classification on Film with 60%/20%/20% random splits for training/validation/test.

Search
Clear search
Close search
Google apps
Main menu