24 datasets found
  1. Z

    Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Aug 8, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841
    Explore at:
    Dataset updated
    Aug 8, 2022
    Dataset authored and provided by
    Köhler, Juliane
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

    Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

    ger_train.csv – The German training set as CSV file.

    ger_validation.csv – The German validation set as CSV file.

    en_test.csv – The English test set as CSV file.

    en_train.csv – The English training set as CSV file.

    en_validation.csv – The English validation set as CSV file.

    splitting.py – The python code for splitting a dataset into train, test and validation set.

    DataSetTrans_de.csv – The final German dataset as a CSV file.

    DataSetTrans_en.csv – The final English dataset as a CSV file.

    translation.py – The python code for translating the cleaned dataset.

  2. u

    Gaussian Process kernels comparison - Datasets and python code

    • figshare.unimelb.edu.au
    bin
    Updated Jun 24, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jiabo Lu; Niels Fraehr; QJ Wang; Xiaohua Xiang; Xiaoling Wu (2024). Gaussian Process kernels comparison - Datasets and python code [Dataset]. http://doi.org/10.26188/26087719.v1
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 24, 2024
    Dataset provided by
    The University of Melbourne
    Authors
    Jiabo Lu; Niels Fraehr; QJ Wang; Xiaohua Xiang; Xiaoling Wu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    OverviewData used for publication in "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". We investigate the impact of 13 Gaussian Process (GP) kernels, consisting of five single kernels and eight composite kernels, on the prediction accuracy and computational efficiency of the Low-fidelity, Spatial analysis, and Gaussian process learning (LSG) modelling approach. The GP kernels are compared for three distinct case studies namely Carlisle (United Kingdom), Chowilla floodplain (Australia), and Burnett River (Australia). The high- and low-fidelity model simulation results are obtained from the data repository Fraehr, N. (2024, January 19). Surrogate flood model comparison - Datasets and python code (Version 1). The University of Melbourne. https://doi.org/10.26188/24312658.v1.Dataset structureThe dataset is structured in 5 file folders:CarlisleChowillaBurnettRVComparison_resultsPython_dataThe first three folders contain simulation data and analysis codes. The "Comparison_results" folder contains plotting codes, figures and tables for comparison results. The "Python_data" folder contains LSG model functions and Python environment requirement.Carlisle, Chowilla, and BurnettRVThese files contain high- and low-fidelity hydrodynamic modelling data for training and validation for each individual case study, as well as specific Python scripts for training and running the LSG model with different GP kernels in each case study. There are only small differences between each folder, depending on the hydrodynamic model simulation results and EOF analysis results.Each case study file has the following folders:Geometry_dataDEM files.npz files containing of the high-fidelity models grid (XYZ-coordinates) and areas (Same data is available for the low-fidelity model used in the LSG model).shp files indicating location of boundaries and main flow pathsXXX_modeldataFolder to storage trained model data for each XXX kernel LSG model. For example, EXP_modeldata contains files used to store the trainined LSG model using exponential Gaussian Process kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.HD_model_dataHigh-fidelity simulation results for all flood events of that case studyLow-fidelity simulation results for all flood events of that case studyAll boundary input conditionsHF_EOF_analysisStoring of data used in the EOF analysis for the LSG model.Results_dataStoring results of running the evaluation of the LSG models with different GP kernel candidates.Train_test_split_dataThe train-test-validation data split is the same for all LSG models with different GP kernel candidates. The specific split for each cross-validation fold is stored in this folder.YYY_event_summary.csv, YYY_Extrap_event_summary.csvFiles containing overview of all events, and which events are connected between the low- and high-fidelity models for each YYY case study.EOF_analysis_HFdata_preprocessing.py, EOF_analysis_HFdata.pyPreprocessing before EOF analysis and the EOF analysis of the high-fidelity data.Evaluation.py, Evaluation_extrap.pyScripts for evaluating the LSG model for that case study and saving the results for each cross-validation fold.train_test_split.pyScript for splitting the flood datasets for each cross-validation fold, so all LSG models with different GP kernel candidates train on the same data.XXX_training.pyScript for training each LSG model using the XXX GP kernel.ME3LIN means ME3 + LIN. ME3mLIN means ME3 x LIN.EXPLow mean inducing points percentage for Sparse GP is 5%.EXPMid mean inducing points percentage for Sparse GP is 15%.EXPHigh mean inducing points percentage for Sparse GP is 35%.EXPFULL mean inducing points percentage for Sparse GP is 100%.XXX_training.batBatch scripts for training all LSG models using different GP kernel candidates.Comparison_resultsFiles used for comparing LSG models using different GP kernel candidates and generate the figures in the paper "Comparing Gaussian Process Kernels Used in LSG Models for Flood Inundation Predictions". Figures are also included.Python_dataFolder containing Python script with utility functions for setting up, training, and running the LSG models, as well as for evaluating the LSG models. Python environmentThis folder also contains two python environment file with all Python package versions and dependencies. You can install CPU version or GPU version of environment. GPU version environment can use GPU to speed up the GPflow training process. It will install cuda and CUDnn package.You can choose to install environment online or offline. Offline installation reduces dependency issues, but it requires that you also use the same Windows 10 operating system as I do.Online installationLSG_CPU_environment.yml: python environment for running LSG models using CPU of the computerLSG_GPU_environment.yml: python environment for running LSG models using GPU of the computer, mainly using GPU to speed up the GPflow training process. It need to install cuda and CUDnn package.In the directory where the .yml file is located, use the console to enter the following commandconda env create -f LSG_CPU_environment.yml -n myenv_nameorconda env create -f LSG_GPU_environment.yml -n myenv_nameOffline installationIf you also use Windows 10 system as I do, you can directly unzip environment packed by conda-pack.LSG_CPU.tar.gz: Zip file containing all packages in the virtual environment for CPU onlyLSG_GPU.tar.gz: Zip file containing all packages in the virtual environment for GPU accelerationIn Windows system, create a new LSG_CPU or LSG_GPU folder in the Anaconda environment folder and extract the packaged LSG_CPU.tar.gz or LSG_GPU.tar.gz file into that folder.tar -xzvf LSG_CPU.tar.gz -C ./LSG_CPUortar -xzvf LSG_GPU.tar.gz -C ./LSG_GPUAccess to the environment pathcd ./LSG_GPUactivation environment.\Scripts\activate.batRemove prefixes from the activation environment.\Scripts\conda-unpack.exeExit environment.\Scripts\deactivate.batLSG_mods_and_funcPython scripts for using the LSG model.Evaluation_metrics.pyMetrics used to evaluate the prediction accuracy and computational efficiency of the LSG models.

  3. OGBN-Products (Processed for PyG)

    • kaggle.com
    Updated Feb 27, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Redao da Taupl (2021). OGBN-Products (Processed for PyG) [Dataset]. https://www.kaggle.com/datasets/dataup1/ogbn-products/data
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 27, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Redao da Taupl
    Description

    OGBN-Products

    Webpage: https://ogb.stanford.edu/docs/nodeprop/#ogbn-products

    Usage in Python

    import os.path as osp
    import pandas as pd
    import datatable as dt
    import torch
    import torch_geometric as pyg
    from ogb.nodeproppred import PygNodePropPredDataset
    
    class PygOgbnProducts(PygNodePropPredDataset):
      def _init_(self, meta_csv = None):
        root, name, transform = '/kaggle/input', 'ogbn-products', None
        if meta_csv is None:
          meta_csv = osp.join(root, name, 'ogbn-master.csv')
        master = pd.read_csv(meta_csv, index_col = 0)
        meta_dict = master[name]
        meta_dict['dir_path'] = osp.join(root, name)
        super()._init_(name = name, root = root, transform = transform, meta_dict = meta_dict)
      def get_idx_split(self, split_type = None):
        if split_type is None:
          split_type = self.meta_info['split']
        path = osp.join(self.root, 'split', split_type)
        if osp.isfile(os.path.join(path, 'split_dict.pt')):
          return torch.load(os.path.join(path, 'split_dict.pt'))
        if self.is_hetero:
          train_idx_dict, valid_idx_dict, test_idx_dict = read_nodesplitidx_split_hetero(path)
          for nodetype in train_idx_dict.keys():
            train_idx_dict[nodetype] = torch.from_numpy(train_idx_dict[nodetype]).to(torch.long)
            valid_idx_dict[nodetype] = torch.from_numpy(valid_idx_dict[nodetype]).to(torch.long)
            test_idx_dict[nodetype] = torch.from_numpy(test_idx_dict[nodetype]).to(torch.long)
            return {'train': train_idx_dict, 'valid': valid_idx_dict, 'test': test_idx_dict}
        else:
          train_idx = dt.fread(osp.join(path, 'train.csv'), header = None).to_numpy().T[0]
          train_idx = torch.from_numpy(train_idx).to(torch.long)
          valid_idx = dt.fread(osp.join(path, 'valid.csv'), header = None).to_numpy().T[0]
          valid_idx = torch.from_numpy(valid_idx).to(torch.long)
          test_idx = dt.fread(osp.join(path, 'test.csv'), header = None).to_numpy().T[0]
          test_idx = torch.from_numpy(test_idx).to(torch.long)
          return {'train': train_idx, 'valid': valid_idx, 'test': test_idx}
    
    dataset = PygOgbnProducts()
    split_idx = dataset.get_idx_split()
    train_idx, valid_idx, test_idx = split_idx['train'], split_idx['valid'], split_idx['test']
    graph = dataset[0] # PyG Graph object
    

    Description

    Graph: The ogbn-products dataset is an undirected and unweighted graph, representing an Amazon product co-purchasing network [1]. Nodes represent products sold in Amazon, and edges between two products indicate that the products are purchased together. The authors follow [2] to process node features and target categories. Specifically, node features are generated by extracting bag-of-words features from the product descriptions followed by a Principal Component Analysis to reduce the dimension to 100.

    Prediction task: The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

    Dataset splitting: The authors consider a more challenging and realistic dataset splitting that differs from the one used in [2] Instead of randomly assigning 90% of the nodes for training and 10% of the nodes for testing (without use of a validation set), use the sales ranking (popularity) to split nodes into training/validation/test sets. Specifically, the authors sort the products according to their sales ranking and use the top 8% for training, next top 2% for validation, and the rest for testing. This is a more challenging splitting procedure that closely matches the real-world application where labels are first assigned to important nodes in the network and ML models are subsequently used to make predictions on less important ones.

    Note 1: A very small number of self-connecting edges are repeated (see here); you may remove them if necessary.

    Note 2: For undirected graphs, the loaded graphs will have the doubled number of edges because the bidirectional edges will be added automatically.

    Summary

    Package#Nodes#EdgesSplit TypeTask TypeMetric
    ogb>=1.1.12,449,02961,859,140Sales rankMulti-class classificationAccuracy

    Open Graph Benchmark

    Website: https://ogb.stanford.edu

    The Open Graph Benchmark (OGB) [3] is a collection of realistic, large-scale, and diverse benchmark datasets for machine learning on graphs. OGB datasets are automatically downloaded, processed, and split using the OGB Data Loader. The model performance can be evaluated using the OGB Evaluator in a unified manner.

    References

    [1] http://manikvarma.org/downloads/XC/XMLRepository.html [2] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 257–266, 2019. [3] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in Neural Information Processing Systems, pp. 22118–22133, 2020.

    License: Amazon License

    By accessing the Amazon Customer Reviews Library ("Reviews Library"), you agree that the Reviews Library is an Amazon Service subject to the Amazon.com Conditions of Use (https://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&nodeId=508088) and you agree to be bound by them, with the following additional conditions: In addition to the license rights granted under the Conditions of Use, Amazon or its content providers grant you a limited, non-exclusive, non-transferable, non-sublicensable, revocable license to access and use the Reviews Library for purposes of academic research. You may not resell, republish, or make any commercial use of the Reviews Library or its contents, including use of the Reviews Library for commercial research, such as research related to a funding or consultancy contract, internship, or other relationship in which the results are provided for a fee or delivered to a for-profit organization. You may not (a) link or associate content in the Reviews Library with any personal information (including Amazon customer accounts), or (b) attempt to determine the identity of the author of any content in the Reviews Library. If you violate any of the foregoing conditions, your license to access and use the Reviews Library will automatically terminate without prejudice to any of the other rights or remedies Amazon may have.

    Disclaimer

    I am NOT the author of this dataset. It was downloaded from its official website. I assume no responsibility or liability for the content in this dataset. Any questions, problems or issues, please contact the original authors at their website or their GitHub repo.

  4. H

    MC-LSTM papers, model runs

    • beta.hydroshare.org
    • hydroshare.org
    • +1more
    zip
    Updated Jan 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonathan Martin Frame (2022). MC-LSTM papers, model runs [Dataset]. http://doi.org/10.4211/hs.d750278db868447dbd252a8c5431affd
    Explore at:
    zip(5.9 GB)Available download formats
    Dataset updated
    Jan 17, 2022
    Dataset provided by
    HydroShare
    Authors
    Jonathan Martin Frame
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Time period covered
    Jan 1, 1989 - Jan 1, 2015
    Area covered
    Description

    Runs from two papers exploring the use of mass conserving LSTM. Model results used in the papers are 1) model_outputs_for_analysis_extreme_events_paper.tar.gz, and 2) model_outputs_for_analysis_mass_balance_paper.tar.gz.

    The models here are trained/calibrated on three different time periods. Standard Time Split (time split 1): test period(1989-1999) is the same period used by previous studies which allows us to confirm that the deep learning models (LSTM andMC-LSTM) trained for this project perform as expected relative to prior work. NWM Time Split (time split 2): The second test period (1995-2014) allows us to benchmark against the NWM-Rv2, which does not provide data prior to 1995. Return period split: The third test period (based on return periods) allows us to benchmark only on water years that contain streamflow events that are larger (per basin) than anything seen in the training data (<= 5-year return periods in training and > 5-year return periods in testing).

    Also included are an ensemble of model runs for LSTM, MC-LSTM for the "standard" training period and two forcing products. These files are provided in the format "

    IMPORTANT NOTE: This python environment should be used to extract and load the data: https://github.com/jmframe/mclstm_2021_extrapolate/blob/main/python_environment.yml, as the pickle files serialized the data with specific versions of python libraries. Specifically, the pickle serialization was done with xarray=0.16.1.

    Code to interpret these runs can be found here: https://github.com/jmframe/mclstm_2021_extrapolate https://github.com/jmframe/mclstm_2021_mass_balance

    Papers are available here: https://hess.copernicus.org/preprints/hess-2021-423/

  5. T

    wider_face

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wider_face [Dataset]. https://www.tensorflow.org/datasets/catalog/wider_face
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WIDER FACE dataset is a face detection benchmark dataset, of which images are selected from the publicly available WIDER dataset. We choose 32,203 images and label 393,703 faces with a high degree of variability in scale, pose and occlusion as depicted in the sample images. WIDER FACE dataset is organized based on 61 event classes. For each event class, we randomly select 40%/10%/50% data as training, validation and testing sets. We adopt the same evaluation metric employed in the PASCAL VOC dataset. Similar to MALF and Caltech datasets, we do not release bounding box ground truth for the test images. Users are required to submit final prediction files, which we shall proceed to evaluate.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wider_face', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/wider_face-0.1.0.png" alt="Visualization" width="500px">

  6. Training and test data for the preparation of the article: Convolutional...

    • 4tu.edu.hpc.n-helix.com
    • data.4tu.nl
    zip
    Updated May 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dmytro Kolenov; D. (Davy) Davidse (2020). Training and test data for the preparation of the article: Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data [Dataset]. http://doi.org/10.4121/uuid:516ab2fa-4c47-42f8-b614-5e283889b218
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 29, 2020
    Dataset provided by
    4TUhttps://www.4tu.nl/
    Authors
    Dmytro Kolenov; D. (Davy) Davidse
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Here we supply the training and test data as used in the prepared publication of "Convolutional Neural Network Applied for Nanoparticle Classification using Coherent Scaterometry Data" by D. Kolenov, D. Davidse, J. Le Cam, S.F. Pereira.

    We present the "main dataset" samples in the pixel size of both 150x150 and 100x100, and for the three "fooling datasets" the pixel size is 100x100. On average each dataset contains 1100 images with the .mat extension. The .mat extension is straightforward with MatLab, but it could also be opened in Python or MS Excel. For the "main dataset" the pixels represent the sampling points, and the magnitude of these pixels represent the em field registered as the photocurrent on the split-detector. For the three types of "fooling data" the images of a 1) noisy and 2) mirrored set are also based on the photocurrent; 3) the elephant set is based on the open-source Animal-10 data.

  7. STEAD subsample 4 CDiffSD

    • zenodo.org
    bin
    Updated Apr 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele Trappolini; Daniele Trappolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    STEAD Subsample Dataset for CDiffSD Training

    Overview

    This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

    Dataset Files

    The dataset includes the following files:

    • train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.
    • noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.
    • test: Used for the testing phase, structured similarly to train.
    • noise_test: Used for the testing phase, contains noise data for testing.

    Each file is structured to support the training and evaluation of seismic denoising models.

    Data

    The HDF5 files named noise contain two main datasets:

    • traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).
    • metadata: This dataset contains the names of the traces for each event.

    Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

    • p_arrival: Contains the arrival indices of P-waves, expressed in counts.
    • s_arrival: Contains the arrival indices of S-waves, also expressed in counts.


    Usage

    To load these files in a Python environment, use the following approach:

    ```python

    import h5py
    import numpy as np

    # Open the HDF5 file in read mode
    with h5py.File('train_noise.hdf5', 'r') as file:
    # Print all the main keys in the file
    print("Keys in the HDF5 file:", list(file.keys()))

    if 'traces' in file:
    # Access the dataset
    data = file['traces'][:10] # Load the first 10 traces

    if 'metadata' in file:
    # Access the dataset
    trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

    Ensure that the path to the file is correctly specified relative to your Python script.

    Requirements

    To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

    ```bash
    pip install numpy
    pip install h5py
    ```

  8. P

    V2 Balloon Detection Dataset Dataset

    • paperswithcode.com
    Updated Sep 5, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). V2 Balloon Detection Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/v2-balloon-detection-dataset
    Explore at:
    Dataset updated
    Sep 5, 2024
    Description

    Description:

    👉 Download the dataset here

    This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which require expert interpretation to evaluate model performance, the simplicity of balloon detection enables users to visually verify predictions without domain expertise.

    The original Balloon dataset was more complex, as it was split into separate training and testing sets, with annotations stored in two separate JSON files. To streamline the experience, this updated version of the dataset merges all images into a single folder and replaces the JSON annotations with a single, easy-to-use CSV file. This new format ensures that the dataset can be loaded seamlessly with tools like Pandas, simplifying the workflow for researchers and developers.

    Download Dataset

    The dataset contains a total of 74 high-quality JPG images. Each featuring one or more balloons in different scenes and contexts. Accompanying the images is a CSV file that provides annotation data. Such as bounding box coordinates and labels for each balloon within the images. This structure makes the dataset easily accessible for a range of machine learning and computer vision tasks. Including object detection and image classification. The dataset is versatile and can be used to test algorithms like YOLO, Faster R-CNN, SSD, or other popular object detection models.

    Key Features:

    Image Format: 74 JPG images, ensuring high compatibility with most machine learning frameworks.

    Annotations: A single CSV file that contains structure data. Including bounding box coordinates, class labels, and image file names, which can be load with Python libraries like Pandas.

    Simplicity: Design for users to quickly start training object detection models without needing to preprocess or deeply explore the dataset.

    Variety: The images feature balloons in various sizes, colors, and scenes, making it suitable for testing the robustness of detection models.

    This dataset is sourced from Kaggle.

  9. T

    imdb_reviews

    • tensorflow.org
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
    Explore at:
    Dataset updated
    Sep 20, 2024
    Description

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imdb_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. p

    Tree Point Classification - New Zealand

    • pacificgeoportal.com
    • geoportal-pacificcore.hub.arcgis.com
    • +1more
    Updated Jul 25, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eagle Technology Group Ltd (2022). Tree Point Classification - New Zealand [Dataset]. https://www.pacificgeoportal.com/content/0e2e3d0d0ef843e690169cac2f5620f9
    Explore at:
    Dataset updated
    Jul 25, 2022
    Dataset authored and provided by
    Eagle Technology Group Ltd
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Description

    This New Zealand Point Cloud Classification Deep Learning Package will classify point clouds into tree and background classes. This model is optimized to work with New Zealand aerial LiDAR data.The classification of point cloud datasets to identify Trees is useful in applications such as high-quality 3D basemap creation, urban planning, forestry workflows, and planning climate change response.Trees could have a complex irregular geometrical structure that is hard to capture using traditional means. Deep learning models are highly capable of learning these complex structures and giving superior results.This model is designed to extract Tree in both urban and rural area in New Zealand.The Training/Testing/Validation dataset are taken within New Zealand resulting of a high reliability to recognize the pattern of NZ common building architecture.Licensing requirementsArcGIS Desktop - ArcGIS 3D Analyst extension for ArcGIS ProUsing the modelThe model can be used in ArcGIS Pro's Classify Point Cloud Using Trained Model tool. Before using this model, ensure that the supported deep learning frameworks libraries are installed. For more details, check Deep Learning Libraries Installer for ArcGIS.Note: Deep learning is computationally intensive, and a powerful GPU is recommended to process large datasets.InputThe model is trained with classified LiDAR that follows the LINZ base specification. The input data should be similar to this specification.Note: The model is dependent on additional attributes such as Intensity, Number of Returns, etc, similar to the LINZ base specification. This model is trained to work on classified and unclassified point clouds that are in a projected coordinate system, in which the units of X, Y and Z are based on the metric system of measurement. If the dataset is in degrees or feet, it needs to be re-projected accordingly. The model was trained using a training dataset with the full set of points. Therefore, it is important to make the full set of points available to the neural network while predicting - allowing it to better discriminate points of 'class of interest' versus background points. It is recommended to use 'selective/target classification' and 'class preservation' functionalities during prediction to have better control over the classification and scenarios with false positives.The model was trained on airborne lidar datasets and is expected to perform best with similar datasets. Classification of terrestrial point cloud datasets may work but has not been validated. For such cases, this pre-trained model may be fine-tuned to save on cost, time, and compute resources while improving accuracy. Another example where fine-tuning this model can be useful is when the object of interest is tram wires, railway wires, etc. which are geometrically similar to electricity wires. When fine-tuning this model, the target training data characteristics such as class structure, maximum number of points per block and extra attributes should match those of the data originally used for training this model (see Training data section below).OutputThe model will classify the point cloud into the following classes with their meaning as defined by the American Society for Photogrammetry and Remote Sensing (ASPRS) described below: 0 Background 5 Trees / High-vegetationApplicable geographiesThe model is expected to work well in the New Zealand. It's seen to produce favorable results as shown in many regions. However, results can vary for datasets that are statistically dissimilar to training data.Training dataset - Wellington CityTesting dataset - Tawa CityValidation/Evaluation dataset - Christchurch City Dataset City Training Wellington Testing Tawa Validating ChristchurchModel architectureThis model uses the PointCNN model architecture implemented in ArcGIS API for Python.Accuracy metricsThe table below summarizes the accuracy of the predictions on the validation dataset. - Precision Recall F1-score Never Classified 0.991200 0.975404 0.983239 High Vegetation 0.933569 0.975559 0.954102Training dataThis model is trained on classified dataset originally provided by Open TopoGraphy with < 1% of manual labelling and correction.Train-Test split percentage {Train: 80%, Test: 20%} Chosen this ratio based on the analysis from previous epoch statistics which appears to have a descent improvementThe training data used has the following characteristics: X, Y, and Z linear unitMeter Z range-121.69 m to 26.84 m Number of Returns1 to 5 Intensity16 to 65520 Point spacing0.2 ± 0.1 Scan angle-15 to +15 Maximum points per block8192 Block Size20 Meters Class structure[0, 5]Sample resultsModel to classify a dataset with 5pts/m density Christchurch city dataset. The model's performance are directly proportional to the dataset point density and noise exlcuded point clouds.To learn how to use this model, see this story

  11. m

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • data.mendeley.com
    • narcis.nl
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    TaeKeun Yoo (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Authors
    TaeKeun Yoo
    License

    Attribution-NonCommercial 3.0 (CC BY-NC 3.0)https://creativecommons.org/licenses/by-nc/3.0/
    License information was derived automatically

    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  12. T

    cifar100

    • tensorflow.org
    • universe.roboflow.com
    • +4more
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cifar100 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar100
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar100', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar100-3.0.2.png" alt="Visualization" width="500px">

  13. T

    ag_news_subset

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). ag_news_subset [Dataset]. http://identifiers.org/arxiv:1509.01626
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    AG is a collection of more than 1 million news articles. News articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non-commercial activity. For more information, please refer to the link http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .

    The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

    The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ag_news_subset', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  14. T

    cifar10_h

    • tensorflow.org
    Updated Jan 18, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). cifar10_h [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10_h
    Explore at:
    Dataset updated
    Jan 18, 2023
    Description

    A re-labeled version of CIFAR-10's test set with soft-labels coming from real human annotators. For every pair (image, label) in the original CIFAR-10 test set, it provides several additional labels given by real human annotators as well as the average soft-label. The training set is identical to the one of the original dataset.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar10_h', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar10_h-1.0.0.png" alt="Visualization" width="500px">

  15. T

    savee

    • tensorflow.org
    Updated Dec 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). savee [Dataset]. https://www.tensorflow.org/datasets/catalog/savee
    Explore at:
    Dataset updated
    Dec 23, 2022
    Description

    SAVEE (Surrey Audio-Visual Expressed Emotion) is an emotion recognition dataset. It consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. This release contains only the audio stream from the original audio-visual recording.

    The data is split so that the training set consists of 2 speakers, and both the validation and test set consists of samples from 1 speaker, respectively.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('savee', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. T

    celeb_a

    • tensorflow.org
    • datasetninja.com
    • +2more
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). celeb_a [Dataset]. https://www.tensorflow.org/datasets/catalog/celeb_a
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter. CelebA has large diversities, large quantities, and rich annotations, including - 10,177 number of identities, - 202,599 number of face images, and - 5 landmark locations, 40 binary attributes annotations per image.

    The dataset can be employed as the training and test sets for the following computer vision tasks: face attribute recognition, face detection, and landmark (or facial part) localization.

    Note: CelebA dataset may contain potential bias. The fairness indicators example goes into detail about several considerations to keep in mind while using the CelebA dataset.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('celeb_a', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/celeb_a-2.1.0.png" alt="Visualization" width="500px">

  17. Z

    Preprocessed Python Code Corpus

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jan 29, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rafael - Michael Karampatsis (2020). Preprocessed Python Code Corpus [Dataset]. https://data.niaid.nih.gov/resources?id=ZENODO_3628635
    Explore at:
    Dataset updated
    Jan 29, 2020
    Dataset provided by
    Rafael - Michael Karampatsis
    Romain Robbes
    Andrea Janes
    Charles Sutton
    Hlib Babii
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A preprocessed code corpus for the Python programming language. The corpus was used for the experiments in the paper Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. It contains preprocessed-tokenized files for training, validation, testing, and BPE encoding learning. The BPE segmented versions of the above files are also included for three different encoding sizes i,e., 2000, 5000, and 10000 BPE merge operations as well as the learned BPE encodings. Similar versions are also contained for splitting compound identifiers on camelCase and snake_case as in (Allamanis et al., 2015) as well as the corresponding subtoken maps.

  18. OSM buildings noisy labels dataset

    • zenodo.org
    • explore.openaire.eu
    • +1more
    zip
    Updated Apr 27, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jonas Gütter; Jonas Gütter (2022). OSM buildings noisy labels dataset [Dataset]. http://doi.org/10.5281/zenodo.6477788
    Explore at:
    zipAvailable download formats
    Dataset updated
    Apr 27, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Jonas Gütter; Jonas Gütter
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset contains tile imagery from the OpenStreetMap project alongside label masks for buildings from OpenStreetMap. Besides the original clean label set, additional noisy label sets for random noise, removed and added buildings are provided.

    The purpose of this dataset is to provide training data for analysing the impact of noisy labels on the performance of models for semantic segmentation in Earth observation.

    The code for downloading and creating the datasets as well as for performing some preliminary analyses is also provided, however it is necessary to have access to a tile server where OpenStreetMap tiles can be downloaded in sufficient amounts.

    To reproduce the dataset and perform analysis on it, do the following:

    • unzip data.zip and code.zip
    • create the folder structure from data
    • Build and activate a python environment from environment.yml
    • Insert the url of a suitable tile server for OSM tiles in line 76 of utils.py
    • Execute download_OSM_dataset.py to download OSM image tiles alongside OSM labels
    • Execute create_noisy_labels.py for the OSM dataset to create noisy label sets
    • Divide the images and labels into train and test data. split_data.py can be used as a baseline for this, but pathnames have to be adjusted and the corresponding directories have to be created first.
    • Call train_model.py to train a model on the data. Specify the data size and the label set by giving command line arguments as shown in train_model.sh

  19. h

    openai_humaneval

    • huggingface.co
    Updated Jan 1, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAI (2022). openai_humaneval [Dataset]. https://huggingface.co/datasets/openai/openai_humaneval
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 1, 2022
    Dataset authored and provided by
    OpenAIhttp://openai.com/
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for OpenAI HumanEval

      Dataset Summary
    

    The HumanEval dataset released by OpenAI includes 164 programming problems with a function sig- nature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

      Supported Tasks and Leaderboards
    
    
    
    
    
    
    
      Languages
    

    The programming problems are written in Python and contain English natural text in comments and… See the full description on the dataset page: https://huggingface.co/datasets/openai/openai_humaneval.

  20. Z

    HTRCatalogs: Dataset for historical catalogs HTR and Segmentation

    • data.niaid.nih.gov
    • zenodo.org
    Updated Sep 7, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Joyeux-Prunel, Beatrice (2021). HTRCatalogs: Dataset for historical catalogs HTR and Segmentation [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5458349
    Explore at:
    Dataset updated
    Sep 7, 2021
    Dataset provided by
    Joyeux-Prunel, Beatrice
    Janes, Juliette
    Gabay, Simon
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This release contains 465 xml files, and their corresponding images from a large corpus of 19th, 20th and 21th exhibition catalogs, manuscripts'fair catalogs and directories. The new catalogs added here were created using the HTR and segmentation models accessible in the repository. It includes a csv file describing the xml files and various tools to create a training dataset: differents bash scripts, a python programm to divide the xml files into testing, training and evaluation dataset and several fixed tests. A xsl transformation sheet is also accessible to delete the Entry and EntryEnd zones from the xml files in order to have a SegmOnto-like dataset. The xml files has been corrected since the 4.0 release thanks to the addition of a github action (SegmOntoKraken).

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Köhler, Juliane (2022). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6957841

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
Dataset updated
Aug 8, 2022
Dataset authored and provided by
Köhler, Juliane
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.

Search
Clear search
Close search
Google apps
Main menu