100+ datasets found
  1. H

    A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution...

    • dataverse.harvard.edu
    Updated Jan 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 19, 2021
    Dataset provided by
    Harvard Dataverse
    Authors
    Lianfa, Li; Jiajie, Wu
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Time period covered
    Jan 1, 2015 - Dec 31, 2018
    Area covered
    China
    Description

    We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.

  2. d

    Data from: Data for Machine Learning Predictions of Nitrate in Shallow...

    • catalog.data.gov
    • data.usgs.gov
    • +1more
    Updated Nov 26, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    U.S. Geological Survey (2025). Data for Machine Learning Predictions of Nitrate in Shallow Groundwater in the Conterminous United States [Dataset]. https://catalog.data.gov/dataset/data-for-machine-learning-predictions-of-nitrate-in-shallow-groundwater-in-the-conterminou
    Explore at:
    Dataset updated
    Nov 26, 2025
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Area covered
    United States, Contiguous United States
    Description

    An extreme gradient boosting (XGB) machine learning model was developed to predict the distribution of nitrate in shallow groundwater across the conterminous United States (CONUS). Nitrate was predicted at a 1-square-kilometer (km) resolution at a depth below the water table of 10 m. The model builds off a previous XGB machine learning model developed to predict nitrate at domestic and public supply groundwater zones (Ransom and others, 2022) by incorporating additional monitoring well samples and modifying and adding predictor variables. The shallow zone model included variables representing well characteristics, hydrologic conditions, soil type, geology, climate, oxidation/reduction, and nitrogen inputs. Predictor variables derived from empirical or numerical process-based models were also included to integrate information on controlling processes and conditions. This data release documents the model and provides the model results. Included in this data release are, 1) a model archive of the R project: source code, input files (including model training and testing data, rasters of all final predictor variables, and an output raster representing predicted nitrate concentration in the shallow zone), 2) a read_me.txt file describing the model archive and an explanation of its use and the modeling details, and 3) a table describing the model variables.

  3. Prediction R-squared evaluated in testing data sets (average over 30...

    • plos.figshare.com
    xls
    Updated Jun 8, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Gustavo de los Campos; Ana I. Vazquez; Rohan Fernando; Yann C. Klimentidis; Daniel Sorensen (2023). Prediction R-squared evaluated in testing data sets (average over 30 randomly drawn testing data sets, each having 500 individuals) by training and validation data sets and model. [Dataset]. http://doi.org/10.1371/journal.pgen.1003608.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 8, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Gustavo de los Campos; Ana I. Vazquez; Rohan Fernando; Yann C. Klimentidis; Daniel Sorensen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    N-FHS = Number of records from Framingham, N-GEN = Number of records from GENEVA. G-BLUP uses 400 K SNPs, wG-BLUP uses 400 K SNPs, but the contribution of each SNP to the genomic relationship matrix was weighted using as weight, where is the SNP associated p-value reported by [5].

  4. s

    TC and DOD data for Zhu et al., 2024

    • purl.stanford.edu
    Updated May 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yuan Wang; Laiyin Zhu (2024). TC and DOD data for Zhu et al., 2024 [Dataset]. http://doi.org/10.25740/vh400jc1009
    Explore at:
    Dataset updated
    May 11, 2024
    Authors
    Yuan Wang; Laiyin Zhu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    • Our dataset contains both R (April23_24.RData and NOV1_23.mat) and Matlab data that we have created for the machine learning model and plots we included in the article • All codes for model training and testing are shared with names “e.g., CodesForNoGeoModel”. All R codes for the Figures in the article is shared as “NewFig1-4” • Codes for SI figures are also shared, such as “scatter_core_outer” and “SI_2”, etc • R libraries “caret”, “xgboost”, “gridExtra”, “cowplot” need to be installed and loaded before running the codes listed

  5. Engine Ratng Prediction

    • kaggle.com
    zip
    Updated Feb 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ved Prakash (2023). Engine Ratng Prediction [Dataset]. https://www.kaggle.com/datasets/ved1104/engine-ratng-prediction
    Explore at:
    zip(3540393 bytes)Available download formats
    Dataset updated
    Feb 28, 2023
    Authors
    Ved Prakash
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Your task is to write a small Python or R script that predicts the engine rating based on the inspection parameters using only the provided dataset. You need to find all the cases/outliers where the rating has been given incorrectly as compared to the current condition of the engine.

    This task is designed to test your Python or R ability, your knowledge of Data Science techniques, your ability to find trends, and outliers, the relative importance of variables with deviation in target variable, and your ability to work effectively, efficiently, and independently within a commercial setting.

    This task is designed as well to test your hyper-tuning abilities or lateral thinking. Deliverables: · One Python or R script · One requirement text file including an exhaustive list of packages and version numbers used in your solution · Summary of your insights · List of cases that are outliers/incorrectly rated as high or low and it should be backed with analysis/reasons. · model object files for reproducibility.

    Your solution should at a minimum do the following: · Load the data into memory · Prepare the data for modeling · EDA of the variables · Build a model on training data · Test the model on testing data · Provide some measure of performance · Outlier analysis and detection

  6. r

    Data from: Spaceborne GNSS-R for Sea Ice Classification Using Machine...

    • resodate.org
    Updated Dec 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yongchao Zhu; Tingye Tao; Jiangyang Li; Kegen Yu; Lei Wang; Xiaochuan Qu; Shuiping Li; Maximilian Semmling; Jens Wickert (2021). Spaceborne GNSS-R for Sea Ice Classification Using Machine Learning Classifiers [Dataset]. http://doi.org/10.14279/depositonce-12822
    Explore at:
    Dataset updated
    Dec 13, 2021
    Dataset provided by
    Technische Universität Berlin
    DepositOnce
    Authors
    Yongchao Zhu; Tingye Tao; Jiangyang Li; Kegen Yu; Lei Wang; Xiaochuan Qu; Shuiping Li; Maximilian Semmling; Jens Wickert
    Description

    The knowledge of Arctic Sea ice coverage is of particular importance in studies of climate change. This study develops a new sea ice classification approach based on machine learning (ML) classifiers through analyzing spaceborne GNSS-R features derived from the TechDemoSat-1 (TDS-1) data collected over open water (OW), first-year ice (FYI), and multi-year ice (MYI). A total of eight features extracted from GNSS-R observables collected in five months are applied to classify OW, FYI, and MYI using the ML classifiers of random forest (RF) and support vector machine (SVM) in a two-step strategy. Firstly, randomly selected 30% of samples of the whole dataset are used as a training set to build classifiers for discriminating OW from sea ice. The performance is evaluated using the remaining 70% of samples through validating with the sea ice type from the Special Sensor Microwave Imager Sounder (SSMIS) data provided by the Ocean and Sea Ice Satellite Application Facility (OSISAF). The overall accuracy of RF and SVM classifiers are 98.83% and 98.60% respectively for distinguishing OW from sea ice. Then, samples of sea ice, including FYI and MYI, are randomly split into training and test dataset. The features of the training set are used as input variables to train the FYI-MYI classifiers, which achieve an overall accuracy of 84.82% and 71.71% respectively by RF and SVM classifiers. Finally, the features in every month are used as training and testing set in turn to cross-validate the performance of the proposed classifier. The results indicate the strong sensitivity of GNSS signals to sea ice types and the great potential of ML classifiers for GNSS-R applications.

  7. U

    Random forest regression model and prediction rasters of fluoride in...

    • data.usgs.gov
    • s.cnmilf.com
    • +1more
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Celia Rosecrans; Katherine Ransom; Olga Rodriguez, Random forest regression model and prediction rasters of fluoride in groundwater in basin-fill aquifers of western United States [Dataset]. http://doi.org/10.5066/P991L1ZR
    Explore at:
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Authors
    Celia Rosecrans; Katherine Ransom; Olga Rodriguez
    License

    U.S. Government Workshttps://www.usa.gov/government-works
    License information was derived automatically

    Time period covered
    Oct 1, 2000 - Jul 1, 2018
    Area covered
    Western United States, United States
    Description

    A random forest regression (RFR) model was developed to predict groundwater fluoride concentrations in four western United Stated principal aquifers —California Coastal basin-fill aquifers, Central Valley aquifer system, Basin and Range basin-fill aquifers, and the Rio Grande aquifer system. The selected basin-fill aquifers are a vital resource for drinking-water supplies. The RFR model was developed with a dataset of over 12,000 wells sampled for fluoride between 2000 and 2018. This data release provides rasters of predicted fluoride concentrations at depth typical of domestic and public supply wells in the selected basin-fill aquifers and includes the final RFR model that documents the prediction modeling process and verifies and reproduces the model fit metrics and mapped predictions in the accompanying publication. Included in this data release are 1) a model archive of the R project including source code, input files (model training and testing data and rasters of predictor ...

  8. UNSW Training and Testing Dataset

    • kaggle.com
    zip
    Updated Feb 8, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Subhasri R A (2024). UNSW Training and Testing Dataset [Dataset]. https://www.kaggle.com/datasets/subhasrira/unsw-training-and-testing-dataset
    Explore at:
    zip(12484064 bytes)Available download formats
    Dataset updated
    Feb 8, 2024
    Authors
    Subhasri R A
    Description

    Dataset

    This dataset was created by Subhasri R A

    Contents

  9. f

    Baseline characteristics of combined training, validation, and testing...

    • datasetcatalog.nlm.nih.gov
    Updated Apr 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Fehlings, Michael G.; Wilson, Jefferson R.; Witiw, Christopher D.; Merali, Zamir G.; Badhiwala, Jetan H. (2019). Baseline characteristics of combined training, validation, and testing dataset. [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0000089421
    Explore at:
    Dataset updated
    Apr 4, 2019
    Authors
    Fehlings, Michael G.; Wilson, Jefferson R.; Witiw, Christopher D.; Merali, Zamir G.; Badhiwala, Jetan H.
    Description

    Baseline characteristics of combined training, validation, and testing dataset.

  10. SP500_data

    • kaggle.com
    zip
    Updated May 28, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Franco Dicosola (2023). SP500_data [Dataset]. https://www.kaggle.com/datasets/francod/s-and-p-500-data
    Explore at:
    zip(39005 bytes)Available download formats
    Dataset updated
    May 28, 2023
    Authors
    Franco Dicosola
    Description

    Project Documentation: Predicting S&P 500 Price Problem Statement: The goal of this project is to develop a machine learning model that can predict the future price of the S&P 500 index based on historical data and relevant features. By accurately predicting the price movements, we aim to assist investors and financial professionals in making informed decisions and managing their portfolios effectively. Dataset Description: The dataset used for this project contains historical data of the S&P 500 index, along with several other features such as dividends, earnings, consumer price index (CPI), interest rates, and more. The dataset spans a certain time period and includes daily values of these variables. Steps Taken: 1. Data Preparation and Exploration: • Loaded the dataset and performed initial exploration. • Checked for missing values and handled them if any. • Explored the statistical summary and distributions of the variables. • Conducted correlation analysis to identify potential features for prediction. 2. Data Visualization and Analysis: • Plotted time series graphs to visualize the S&P 500 index and other variables over time. • Examined the trends, seasonality, and residual behavior of the time series using decomposition techniques. • Analyzed the relationships between the S&P 500 index and other features using scatter plots and correlation matrices. 3. Feature Engineering and Selection: • Selected relevant features based on correlation analysis and domain knowledge. • Explored feature importance using tree-based models and selected informative features. • Prepared the final feature set for model training. 4. Model Training and Evaluation: • Split the dataset into training and testing sets. • Selected a regression model (Linear Regression) for price prediction. • Trained the model using the training set. • Evaluated the model's performance using mean squared error (MSE) and R-squared (R^2) metrics on both training and testing sets. 5. Prediction and Interpretation: • Obtained predictions for future S&P 500 prices using the trained model. • Interpreted the predicted prices in the context of the current market conditions and the percentage change from the current price. Limitations and Future Improvements: • The predictive performance of the model is based on the available features and historical data, and it may not capture all the complexities and factors influencing the S&P 500 index. • The model's accuracy and reliability are subject to the quality and representativeness of the training data. • The model assumes that the historical patterns and relationships observed in the data will continue in the future, which may not always hold true. • Future improvements could include incorporating additional relevant features, exploring different regression algorithms, and considering more sophisticated techniques such as time series forecasting models.

  11. f

    Data from: Leveraging Supervised Machine Learning Algorithms for System...

    • acs.figshare.com
    zip
    Updated Sep 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Russell R. Kibbe; Alexandria L. Sohn; David C. Muddiman (2024). Leveraging Supervised Machine Learning Algorithms for System Suitability Testing of Mass Spectrometry Imaging Platforms [Dataset]. http://doi.org/10.1021/acs.jproteome.4c00360.s001
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 3, 2024
    Dataset provided by
    ACS Publications
    Authors
    Russell R. Kibbe; Alexandria L. Sohn; David C. Muddiman
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Quality control and system suitability testing are vital protocols implemented to ensure the repeatability and reproducibility of data in mass spectrometry investigations. However, mass spectrometry imaging (MSI) analyses present added complexity since both chemical and spatial information are measured. Herein, we employ various machine learning algorithms and a novel quality control mixture to classify the working conditions of an MSI platform. Each algorithm was evaluated in terms of its performance on unseen data, validated with negative control data sets to rule out confounding variables or chance agreement, and utilized to determine the necessary sample size to achieve a high level of accurate classifications. In this work, a robust machine learning workflow was established where models could accurately classify the instrument condition as clean or compromised based on data metrics extracted from the analyzed quality control sample. This work highlights the power of machine learning to recognize complex patterns in MSI data and use those relationships to perform a system suitability test for MSI platforms.

  12. t

    Marmoset - train and test data - Vdataset - LDM

    • service.tib.eu
    Updated May 16, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). Marmoset - train and test data - Vdataset - LDM [Dataset]. https://service.tib.eu/ldmservice/dataset/goe-doi-10-25625-dyg3kv
    Explore at:
    Dataset updated
    May 16, 2025
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    Contains recordings and manual annotations of calls from pairs of male and female marmosets. Manual annotations were created by the original authors and manually corrected for training and testing DAS. Original data source for the recordings and the annotations: https://osf.io/q4bm3/ Original reference: Landman R, Sharma J, Hyman JB, Fanucci-Kiss A, Meisner O, Parmar S, Feng G, Desimone R. 2020. Close-range vocal interaction in the common marmoset (Callithrix jacchus). PLOS ONE 15:e0227392. doi:10.1371/journal.pone.0227392

  13. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  14. Z

    Codes in R for spatial statistics analysis, ecological response models and...

    • data.niaid.nih.gov
    • data-staging.niaid.nih.gov
    Updated Feb 6, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F. (2023). Codes in R for spatial statistics analysis, ecological response models and spatial distribution models [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7603556
    Explore at:
    Dataset updated
    Feb 6, 2023
    Dataset provided by
    Facultad de Ciencias, Universidad Autónoma de San Luis Potosí. San Luis Potosí, S.L.P. México.
    Campus San Luis, Colegio de Postgraduados. Salinas de Hidalgo, S.L.P. México.
    Authors
    Rössel-Ramírez, D. W.; Palacio-Núñez, J.; Espinosa, S.; Martínez-Montoya, J. F.
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In the last decade, a plethora of algorithms have been developed for spatial ecology studies. In our case, we use some of these codes for underwater research work in applied ecology analysis of threatened endemic fishes and their natural habitat. For this, we developed codes in Rstudio® script environment to run spatial and statistical analyses for ecological response and spatial distribution models (e.g., Hijmans & Elith, 2017; Den Burg et al., 2020). The employed R packages are as follows: caret (Kuhn et al., 2020), corrplot (Wei & Simko, 2017), devtools (Wickham, 2015), dismo (Hijmans & Elith, 2017), gbm (Freund & Schapire, 1997; Friedman, 2002), ggplot2 (Wickham et al., 2019), lattice (Sarkar, 2008), lattice (Musa & Mansor, 2021), maptools (Hijmans & Elith, 2017), modelmetrics (Hvitfeldt & Silge, 2021), pander (Wickham, 2015), plyr (Wickham & Wickham, 2015), pROC (Robin et al., 2011), raster (Hijmans & Elith, 2017), RColorBrewer (Neuwirth, 2014), Rcpp (Eddelbeuttel & Balamura, 2018), rgdal (Verzani, 2011), sdm (Naimi & Araujo, 2016), sf (e.g., Zainuddin, 2023), sp (Pebesma, 2020) and usethis (Gladstone, 2022).

    It is important to follow all the codes in order to obtain results from the ecological response and spatial distribution models. In particular, for the ecological scenario, we selected the Generalized Linear Model (GLM) and for the geographic scenario we selected DOMAIN, also known as Gower's metric (Carpenter et al., 1993). We selected this regression method and this distance similarity metric because of its adequacy and robustness for studies with endemic or threatened species (e.g., Naoki et al., 2006). Next, we explain the statistical parameterization for the codes immersed in the GLM and DOMAIN running:

    In the first instance, we generated the background points and extracted the values of the variables (Code2_Extract_values_DWp_SC.R). Barbet-Massin et al. (2012) recommend the use of 10,000 background points when using regression methods (e.g., Generalized Linear Model) or distance-based models (e.g., DOMAIN). However, we considered important some factors such as the extent of the area and the type of study species for the correct selection of the number of points (Pers. Obs.). Then, we extracted the values of predictor variables (e.g., bioclimatic, topographic, demographic, habitat) in function of presence and background points (e.g., Hijmans and Elith, 2017).

    Subsequently, we subdivide both the presence and background point groups into 75% training data and 25% test data, each group, following the method of Soberón & Nakamura (2009) and Hijmans & Elith (2017). For a training control, the 10-fold (cross-validation) method is selected, where the response variable presence is assigned as a factor. In case that some other variable would be important for the study species, it should also be assigned as a factor (Kim, 2009).

    After that, we ran the code for the GBM method (Gradient Boost Machine; Code3_GBM_Relative_contribution.R and Code4_Relative_contribution.R), where we obtained the relative contribution of the variables used in the model. We parameterized the code with a Gaussian distribution and cross iteration of 5,000 repetitions (e.g., Friedman, 2002; kim, 2009; Hijmans and Elith, 2017). In addition, we considered selecting a validation interval of 4 random training points (Personal test). The obtained plots were the partial dependence blocks, in function of each predictor variable.

    Subsequently, the correlation of the variables is run by Pearson's method (Code5_Pearson_Correlation.R) to evaluate multicollinearity between variables (Guisan & Hofer, 2003). It is recommended to consider a bivariate correlation ± 0.70 to discard highly correlated variables (e.g., Awan et al., 2021).

    Once the above codes were run, we uploaded the same subgroups (i.e., presence and background groups with 75% training and 25% testing) (Code6_Presence&backgrounds.R) for the GLM method code (Code7_GLM_model.R). Here, we first ran the GLM models per variable to obtain the p-significance value of each variable (alpha ≤ 0.05); we selected the value one (i.e., presence) as the likelihood factor. The generated models are of polynomial degree to obtain linear and quadratic response (e.g., Fielding and Bell, 1997; Allouche et al., 2006). From these results, we ran ecological response curve models, where the resulting plots included the probability of occurrence and values for continuous variables or categories for discrete variables. The points of the presence and background training group are also included.

    On the other hand, a global GLM was also run, from which the generalized model is evaluated by means of a 2 x 2 contingency matrix, including both observed and predicted records. A representation of this is shown in Table 1 (adapted from Allouche et al., 2006). In this process we select an arbitrary boundary of 0.5 to obtain better modeling performance and avoid high percentage of bias in type I (omission) or II (commission) errors (e.g., Carpenter et al., 1993; Fielding and Bell, 1997; Allouche et al., 2006; Kim, 2009; Hijmans and Elith, 2017).

    Table 1. Example of 2 x 2 contingency matrix for calculating performance metrics for GLM models. A represents true presence records (true positives), B represents false presence records (false positives - error of commission), C represents true background points (true negatives) and D represents false backgrounds (false negatives - errors of omission).

    Validation set

    Model

    True

    False

    Presence

    A

    B

    Background

    C

    D

    We then calculated the Overall and True Skill Statistics (TSS) metrics. The first is used to assess the proportion of correctly predicted cases, while the second metric assesses the prevalence of correctly predicted cases (Olden and Jackson, 2002). This metric also gives equal importance to the prevalence of presence prediction as to the random performance correction (Fielding and Bell, 1997; Allouche et al., 2006).

    The last code (i.e., Code8_DOMAIN_SuitHab_model.R) is for species distribution modelling using the DOMAIN algorithm (Carpenter et al., 1993). Here, we loaded the variable stack and the presence and background group subdivided into 75% training and 25% test, each. We only included the presence training subset and the predictor variables stack in the calculation of the DOMAIN metric, as well as in the evaluation and validation of the model.

    Regarding the model evaluation and estimation, we selected the following estimators:

    1) partial ROC, which evaluates the approach between the curves of positive (i.e., correctly predicted presence) and negative (i.e., correctly predicted absence) cases. As farther apart these curves are, the model has a better prediction performance for the correct spatial distribution of the species (Manzanilla-Quiñones, 2020).

    2) ROC/AUC curve for model validation, where an optimal performance threshold is estimated to have an expected confidence of 75% to 99% probability (De Long et al., 1988).

  15. n

    A Consensus of In-silico Sequence-based Modeling Techniques for...

    • narcis.nl
    • data.mendeley.com
    Updated Nov 1, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mall, R (via Mendeley Data) (2020). A Consensus of In-silico Sequence-based Modeling Techniques for Compound-Viral Protein Activity Prediction for SARS-COV-2 [Dataset]. http://doi.org/10.17632/8rrwnbcgmx.1
    Explore at:
    Dataset updated
    Nov 1, 2020
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Mall, R (via Mendeley Data)
    Description

    Here we provide the datasets used for training and testing of the end-to-end supervised deep learning models as well as the datasets used with vector representations of compounds and proteins and passed to supervised state-of-the-art machine learning models (XGBoost, RF, SVM). We also provide the full list of viral proteins with their sequences used for the protein autoencoder along with the list of SMILES representations of compounds used for the compound autoencoder.

  16. Atlas of virus specific CD8+ Tcells

    • zenodo.org
    bin, csv, html, txt +1
    Updated Oct 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Florian Schmidt; Florian Schmidt (2023). Atlas of virus specific CD8+ Tcells [Dataset]. http://doi.org/10.5281/zenodo.8330231
    Explore at:
    csv, zip, bin, html, txtAvailable download formats
    Dataset updated
    Oct 20, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Florian Schmidt; Florian Schmidt
    Description

    Code and Data accompanying the manuscript " "

    TS logistic regression model:

    • TargetScape_Model_FCS_Input.zip contains the TS data in pickled format to be imported in python
    • targetscape-model-paper-20221001.html contains the python code to perform data analysis, model training, testing and interpretation

    TAP logistic regression model:

    • Generate_Pseudobulk_Markers_ADT_TAP_Cohort.R to generate candidate features by pseudo-bulking surface markers across donors.
    • TAP_ParameterBenchmarking_Model_Training_Evaluation.R to perform model training and parameter benchmark.
    • TAP_DiscoveryCohort_Epitopes.rds file containing a Seurat object of the TAP-Cohort.
    • Cohort_Pseudobulk_ADT_Virus_markers.RDS RDS file containing the candidate markers and their fold-changes/p-values, generated by Generate_Pseudobulk_Markers_ADT_TAP_Cohort.R
    • Leave-one-epitope-out-TAP.R: R script to train a model using one epitope as a validation set.

    Figures related to Machine Learning approaches:

    • Model_Analysis_Figure_Table_Generation.R script to generate Figures 3 and Figure 4 of the main manuscript as well as corresponding Supplementary Figures
    • TargetScapeModel_Feature_Importance.csv contains the regression coefficients of the model learned on TargetScape (TS) data.
    • TargetScape_Cohort_Logicle.csv contains the actual data of the TS cohort after logicle transformation.
    • TargetScape_Validation_Cohort_Logicle.csv contains the actual data of the TS validation cohort after logicle transformation
    • TargetScape_Model_Predictions.csv contains predictions of the TS model made on the TS validation cohort
    • Performance_Benchmarking_TAP.zip contains all parameter benchmarking performance files generated by the TAP model above while executing TAP_ParameterBenchmarking_Model_Training_Evaluation.R.
    • Performance_Selected_Model.txt contains model performance information on the selected TAP model
    • TAP_Model_MultinomialElaNetLogRegModel_minCellsPerPatient_30_minCellsPerVirus_200_ResourcePaper.RDS contains the best model trained on TAP data.
    • TAP-Validation.RDS contains the Seurat object of the TAP validation cohort.
  17. H

    Additional Tennessee Eastman Process Simulation Data for Anomaly Detection...

    • dataverse.harvard.edu
    • dataone.org
    Updated Jul 6, 2017
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook (2017). Additional Tennessee Eastman Process Simulation Data for Anomaly Detection Evaluation [Dataset]. http://doi.org/10.7910/DVN/6C3JR1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 6, 2017
    Dataset provided by
    Harvard Dataverse
    Authors
    Cory A. Rieth; Ben D. Amsel; Randy Tran; Maia B. Cook
    License

    https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0/customlicense?persistentId=doi:10.7910/DVN/6C3JR1

    Description

    User Agreement, Public Domain Dedication, and Disclaimer of Liability. By accessing or downloading the data or work provided here, you, the User, agree that you have read this agreement in full and agree to its terms. The person who owns, created, or contributed a work to the data or work provided here dedicated the work to the public domain and has waived his or her rights to the work worldwide under copyright law. You can copy, modify, distribute, and perform the work, for any lawful purpose, without asking permission. In no way are the patent or trademark rights of any person affected by this agreement, nor are the rights that any other person may have in the work or in how the work is used, such as publicity or privacy rights. Pacific Science & Engineering Group, Inc., its agents and assigns, make no warranties about the work and disclaim all liability for all uses of the work, to the fullest extent permitted by law. When you use or cite the work, you shall not imply endorsement by Pacific Science & Engineering Group, Inc., its agents or assigns, or by another author or affirmer of the work. This Agreement may be amended, and the use of the data or work shall be governed by the terms of the Agreement at the time that you access or download the data or work from this Website. Description This dataverse contains the data referenced in Rieth et al. (2017). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. To be presented at Applied Human Factors and Ergonomics 2017. Each .RData file is an external representation of an R dataframe that can be read into an R environment with the 'load' function. The variables loaded are named ‘fault_free_training’, ‘fault_free_testing’, ‘faulty_testing’, and ‘faulty_training’, corresponding to the RData files. Each dataframe contains 55 columns: Column 1 ('faultNumber') ranges from 1 to 20 in the “Faulty” datasets and represents the fault type in the TEP. The “FaultFree” datasets only contain fault 0 (i.e. normal operating conditions). Column 2 ('simulationRun') ranges from 1 to 500 and represents a different random number generator state from which a full TEP dataset was generated (Note: the actual seeds used to generate training and testing datasets were non-overlapping). Column 3 ('sample') ranges either from 1 to 500 (“Training” datasets) or 1 to 960 (“Testing” datasets). The TEP variables (columns 4 to 55) were sampled every 3 minutes for a total duration of 25 hours and 48 hours respectively. Note that the faults were introduced 1 and 8 hours into the Faulty Training and Faulty Testing datasets, respectively. Columns 4 to 55 contain the process variables; the column names retain the original variable names. Acknowledgments. This work was sponsored by the Office of Naval Research, Human & Bioengineered Systems (ONR 341), program officer Dr. Jeffrey G. Morrison under contract N00014-15-C-5003. The views expressed are those of the authors and do not reflect the official policy or position of the Office of Naval Research, Department of Defense, or US Government.

  18. vfillDL: A geomorphology deep learning dataset of valley fill faces...

    • figshare.com
    bin
    Updated Mar 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Aaron Maxwell (2023). vfillDL: A geomorphology deep learning dataset of valley fill faces resulting from mountaintop removal coal mining (southern West Virginia, eastern Kentucky, and southwestern Virginia, USA) [Dataset]. http://doi.org/10.6084/m9.figshare.22318522.v2
    Explore at:
    binAvailable download formats
    Dataset updated
    Mar 22, 2023
    Dataset provided by
    Figsharehttp://figshare.com/
    figshare
    Authors
    Aaron Maxwell
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Southwest Virginia, Southern West Virginia, West Virginia, United States
    Description

    scripts.zip

    arcgisTools.atbx: terrainDerivatives: make terrain derivatives from digital terrain model (Band 1 = TPI (50 m radius circle), Band 2 = square root of slope, Band 3 = TPI (annulus), Band 4 = hillshade, Band 5 = multidirectional hillshades, Band 6 = slopeshade). rasterizeFeatures: convert vector polygons to raster masks (1 = feature, 0 = background).

    makeChips.R: R function to break terrain derivatives and chips into image chips of a defined size. makeTerrainDerivatives.R: R function to generated 6-band terrain derivatives from digital terrain data (same as ArcGIS Pro tool). merge_logs.R: R script to merge training logs into a single file. predictToExtents.ipynb: Python notebook to use trained model to predict to new data. trainExperiments.ipynb: Python notebook used to train semantic segmentation models using PyTorch and the Segmentation Models package. assessmentExperiments.ipynb: Python code to generate assessment metrics using PyTorch and the torchmetrics library. graphs_results.R: R code to make graphs with ggplot2 to summarize results. makeChipsList.R: R code to generate lists of chips in a directory. makeMasks.R: R function to make raster masks from vector data (same as rasterizeFeatures ArcGIS Pro tool).

    vfillDL.zip

    dems: LiDAR DTM data partitioned into training, three testing, and two validation datasets. Original DTM data were obtained from 3DEP (https://www.usgs.gov/3d-elevation-program) and the WV GIS Technical Center (https://wvgis.wvu.edu/) . extents: extents of the training, testing, and validation areas. These extents were defined by the researchers. vectors: vector features representing valley fills and partitioned into separate training, testing, and validation datasets. Extents were created by the researchers.

  19. S

    Galaxy, star, quasar dataset

    • scidb.cn
    Updated Feb 3, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Li Xin (2023). Galaxy, star, quasar dataset [Dataset]. http://doi.org/10.57760/sciencedb.07177
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 3, 2023
    Dataset provided by
    Science Data Bank
    Authors
    Li Xin
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.

  20. t

    Fan Zhang, Mariana Afonso, David R. Bull (2024). Dataset: JVET Common Test...

    • service.tib.eu
    Updated Dec 3, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Fan Zhang, Mariana Afonso, David R. Bull (2024). Dataset: JVET Common Test Conditions. https://doi.org/10.57702/24smssy2 [Dataset]. https://service.tib.eu/ldmservice/dataset/jvet-common-test-conditions
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    The dataset used in the paper is a collection of video sequences with varying resolutions and bit depths, used for training and testing the proposed video compression framework.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Lianfa, Li; Jiajie, Wu (2021). A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China [Dataset]. http://doi.org/10.7910/DVN/RNSWRH

A Complete Aerosol Optical Depth Dataset with High Spatiotemporal Resolution for Mainland China

Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jan 19, 2021
Dataset provided by
Harvard Dataverse
Authors
Lianfa, Li; Jiajie, Wu
License

CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically

Time period covered
Jan 1, 2015 - Dec 31, 2018
Area covered
China
Description

We share the complete aerosol optical depth dataset with high spatial (1x1km^2) and temporal (daily) resolution and the Beijing 1954 projection (https://epsg.io/2412) for mainland China (2015-2018). The original aerosol optical depth images are from Multi-Angle Implementation of Atmospheric Correction Aerosol Optical Depth (MAIAC AOD) (https://lpdaac.usgs.gov/products/mcd19a2v006/) with the similar spatiotemporal resolution and the sinusoidal projection (https://en.wikipedia.org/wiki/Sinusoidal_projection). After projection conversion, eighteen tiles of MAIAC AOD were merged to obtain a large image of AOD covering the entire area of mainland China. Due to the conditions of clouds and high surface reflectance, each original MAIAC AOD image usually has many missing values, and the average missing percentage of each AOD image may exceed 60%. Such a high percentage of missing values severely limits applicability of the original MAIAC AOD dataset product. We used the sophisticated method of full residual deep networks (Li et al, 2020, https://ieeexplore.ieee.org/document/9186306) to impute the daily missing MAIAC AOD, thus obtaining the complete (no missing values) high-resolution AOD data product covering mainland China. The covariates used in imputation included coordinates, elevation, MERRA2 coarse-resolution PBLH and AOD variables, cloud fraction, high-resolution meteorological variables (air pressure, air temperature, relative humidity and wind speed) and/or time index etc. Ground monitoring data were used to generate high-resolution meteorological variables to ensure the reliability of interpolation. Overall, our daily imputation models achieved an average training R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075, with a range of 0.026 to 0.32) and an average test R^2 of 0.90 with a range of 0.75 to 0.97 (average RMSE: 0.075 with a range of 0.026 to 0.32). With almost no difference between training metrics and test metrics, the high test R^2 and low test RMSE show the reliability of AOD imputation. In the evaluation using the ground AOD data from the monitoring stations of the Aerosol Robot Network (AERONET) in mainland China, our method obtained a R^2 of 0.78 and RMSE of 0.27, which further illustrated the reliability of the method. This database contains four datasets: - Daily complete high-resolution AOD image dataset for mainland China from January 1, 2015 to December 31, 2018. The archived resources contain 1461 images stored in 1461 files, and 3 summary Excel files. The table “CHN_AOD_INFO.xlsx” describing the properties of the 1461 images, including projection, training R^2 and RMSE, testing R^2 and RMSE, minmum, mean, median and maximum AOD that we predicted. - The table “Model_and_Accuracy_of_Meteorological_Elements.xlsx” describing the statistics of performance metrics in interpolation of high-resolution meteorological dataset. - The table “Evaluation_Using_AERONET_AOD.xlsx” showing the evaluation result of AERONET, including R^2, RMSE, and monitoring information used in this study.

Search
Clear search
Close search
Google apps
Main menu