17 datasets found

Rescaled CIFAR-10 dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Fruits-360 dataset
kaggle.com
paperswithcode.com
+1more
Updated Jun 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mihai Oltean (2025). Fruits-360 dataset [Dataset]. https://www.kaggle.com/datasets/moltean/fruits
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jun 7, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Mihai Oltean
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

The following fruits, vegetables and nuts and are included: Apples (different varieties: Crimson Snow, Golden, Golden-Red, Granny Smith, Pink Lady, Red, Red Delicious), Apricot, Avocado, Avocado ripe, Banana (Yellow, Red, Lady Finger), Beans, Beetroot Red, Blackberry, Blueberry, Cabbage, Caju seed, Cactus fruit, Cantaloupe (2 varieties), Carambula, Carrot, Cauliflower, Cherimoya, Cherry (different varieties, Rainier), Cherry Wax (Yellow, Red, Black), Chestnut, Clementine, Cocos, Corn (with husk), Cucumber (ripened, regular), Dates, Eggplant, Fig, Ginger Root, Goosberry, Granadilla, Grape (Blue, Pink, White (different varieties)), Grapefruit (Pink, White), Guava, Hazelnut, Huckleberry, Kiwi, Kaki, Kohlrabi, Kumsquats, Lemon (normal, Meyer), Lime, Lychee, Mandarine, Mango (Green, Red), Mangostan, Maracuja, Melon Piel de Sapo, Mulberry, Nectarine (Regular, Flat), Nut (Forest, Pecan), Onion (Red, White), Orange, Papaya, Passion fruit, Peach (different varieties), Pepino, Pear (different varieties, Abate, Forelle, Kaiser, Monster, Red, Stone, Williams), Pepper (Red, Green, Orange, Yellow), Physalis (normal, with Husk), Pineapple (normal, Mini), Pistachio, Pitahaya Red, Plum (different varieties), Pomegranate, Pomelo Sweetie, Potato (Red, Sweet, White), Quince, Rambutan, Raspberry, Redcurrant, Salak, Strawberry (normal, Wedge), Tamarillo, Tangelo, Tomato (different varieties, Maroon, Cherry Red, Yellow, not ripened, Heart), Walnut, Watermelon, Zucchini (green and dark).

Branches

The dataset has 5 major branches:

-The 100x100 branch, where all images have 100x100 pixels. See _fruits-360_100x100_ folder.

-The original-size branch, where all images are at their original (captured) size. See _fruits-360_original-size_ folder.

-The meta branch, which contains additional information about the objects in the Fruits-360 dataset. See _fruits-360_dataset_meta_ folder.

-The multi branch, which contains images with multiple fruits, vegetables, nuts and seeds. These images are not labeled. See _fruits-360_multi_ folder.

-The _3_body_problem_ branch where the Training and Test folders contain different (varieties of) the 3 fruits and vegetables (Apples, Cherries and Tomatoes). See _fruits-360_3-body-problem_ folder.

How to cite

Mihai Oltean, Fruits-360 dataset, 2017-

Dataset properties

For the 100x100 branch

Total number of images: 138704.

Training set size: 103993 images.

Test set size: 34711 images.

Number of classes: 206 (fruits, vegetables, nuts and seeds).

Image size: 100x100 pixels.

For the original-size branch

Total number of images: 58363.

Training set size: 29222 images.

Validation set size: 14614 images

Test set size: 14527 images.

Number of classes: 90 (fruits, vegetables, nuts and seeds).

Image size: various (original, captured, size) pixels.

For the 3-body-problem branch

Total number of images: 47033.

Training set size: 34800 images.

Test set size: 12233 images.

Number of classes: 3 (Apples, Cherries, Tomatoes).

Number of varieties: Apples = 29; Cherries = 12; Tomatoes = 19.

Image size: 100x100 pixels.

For the meta branch

Number of classes: 26 (fruits, vegetables, nuts and seeds).

For the multi branch

Number of images: 150.

Filename format:

For the 100x100 branch

image_index_100.jpg (e.g. 31_100.jpg) or

r_image_index_100.jpg (e.g. r_31_100.jpg) or

r?_image_index_100.jpg (e.g. r2_31_100.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis. "100" comes from image size (100x100 pixels).

Different varieties of the same fruit (apple, for instance) are stored as belonging to different classes.

For the original-size branch

r?_image_index.jpg (e.g. r2_31.jpg)

where "r" stands for rotated fruit. "r2" means that the fruit was rotated around the 3rd axis.

The name of the image files in the new version does NOT contain the "_100" suffix anymore. This will help you to make the distinction between the original-size branch and the 100x100 branch.

For the multi branch

The file's name is the concatenation of the names of the fruits inside that picture.

Alternate download

The Fruits-360 dataset can be downloaded from:

Kaggle https://www.kaggle.com/moltean/fruits

GitHub https://github.com/fruits-360

How fruits were filmed

Fruits and vegetables were planted in the shaft of a low-speed motor (3 rpm) and a short movie of 20 seconds was recorded.

A Logitech C920 camera was used for filming the fruits. This is one of the best webcams available.

Behind the fruits, we placed a white sheet of paper as a background.

Here i...
o
Movie Review Sentiment and Rationale Dataset
opendatabay.com
.undefined
Updated Jul 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Datasimple (2025). Movie Review Sentiment and Rationale Dataset [Dataset]. https://www.opendatabay.com/data/ai-ml/056ebe3b-4213-4643-b69d-3933e0cfa443
Explore at:
.undefinedAvailable download formats
Dataset updated
Jul 7, 2025
Dataset authored and provided by
Datasimple
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
Entertainment & Media Consumption
Description
This dataset was created to help researchers gain a deep understanding of human-generated movie reviews. It offers insights into aspects such as sentiment labels and the underlying rationales for these reviews. By analysing the information within, users can discover patterns and correlations that are invaluable for developing models capable of uncovering the importance of unique human perspectives in interpreting movie reviews. It aims to provide useful insights into better understanding user intent when reviewing movies.

Columns

review: This column contains the actual text of the movie review. (String)

label: This indicates the sentiment label assigned to the review, which can be Positive, Negative, or Neutral. It is represented by an integer (1 for Positive, -1 for Negative, 0 for Neutral). (Integer)

evidence: This column provides evidence or rationales associated with the reviews, which can be used to validate or train models for understanding human perspectives.

Distribution

The dataset is provided in CSV format and includes distinct train, test, and validation sets. Specific numbers for rows or records are not explicitly available within the provided information. However, the 'evidences' column within the test set contains 199 unique values, and the 'label' column also has 199 unique values, indicating the scale of some of the contained data.

Usage

This dataset is ideal for various applications, including: * Analysing human-generated movie reviews, their sentiments, and the rationales behind them. * Developing advanced models to interpret human perspectives and user intent in movie reviews. * Natural Language Processing (NLP) tasks and other Artificial Intelligence (AI) applications. * Building an automated movie review summariser based on user ratings. * Predicting review sentiment by combining machine learning models with human-annotated rationales. * Creating AI systems to detect linguistic markers of deception in reviews. * Developing simple machine learning recommendation systems.

To use the dataset, one needs a suitable working environment, such as Python or R, with access to NLP libraries. The recommended steps involve importing the CSV files, preprocessing text data in the 'review' and 'label' columns, training and testing machine learning algorithms using feature extraction techniques like Bag Of Words, TF-IDF, or Word2Vec, and then measuring performance accuracy.

Coverage

The dataset's regional scope is Global. No specific information regarding time range or demographic scope is detailed in the available sources.

License

CC0.

Who Can Use It

This dataset is particularly useful for: * Data scientists seeking to explore and analyse movie review data. * Researchers interested in AI applications, machine learning, and understanding human behaviour in online reviews. * Developers looking to build or enhance systems related to sentiment analysis, recommendation, or text summarisation. * Anyone aiming to gain insights into human perspectives when interpreting movie reviews.

Dataset Name Suggestions

Movie Rationales (Rationales For Movie Reviews)

Movie Review Sentiment Analysis

Human Movie Review Rationales

Movie Review Sentiment and Rationale Dataset

User Intent in Movie Reviews

Attributes

Original Data Source: Movie Rationales (Rationales For Movie Reviews)
Blue Bot Dataset: Train, Test, Validate
kaggle.com
Updated Nov 25, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Bajan Digital Creations Incorporated (2020). Blue Bot Dataset: Train, Test, Validate [Dataset]. https://www.kaggle.com/hiyaro/bluenetpenta/discussion
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 25, 2020
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Bajan Digital Creations Incorporated
License
Attribution-NonCommercial-NoDerivs 4.0 (CC BY-NC-ND 4.0)https://creativecommons.org/licenses/by-nc-nd/4.0/
License information was derived automatically
Description
Context

“I remember that, though of humble origin, the sea was always the living pantry. The memories of my uncles spear-fishing in the waters off Inch Marlowe are fond memories. Unfortunately, they are just that, memories!

My children love visiting Barbados. However, their ancestral waters do not have the abundance of life I recalled. They cannot live the childhood that I had and that saddens me.”

S. Antonio Hollingsworth, Founder BDCI Barbados

This dataset was created to give Caribbean developers in the field of artificial intelligence and machine learning a head start in training the next generation of A.I. and machine learning applications. We believe that to meet the challenges of reef collapse due to human activity, artificial intelligence will give small island developing states the edge needed to remain competitive and survive in a rapidly changing world.

Content

What's inside is more than just rows and columns. Make it easy for others to get started by describing how you acquired the data and what time period it represents, too.

This dataset contains image data of target fish species. It is categorical in nature and is intended for use in computer vision.

This dataset contains images of fish in different natural positions, lighting and water conditions.

The fish are presented in there natural environment.

Some images may contain more than one member of the target species or it may contain another species that is while not dominant may influence the training process.

Data collection period: August - November 2020. Data collection location: Barbados. General data coordinate: 13.1939° N, 59.5432° W. Data collection depth range: 0m to 5m. Data collection climate: Tropical, Marine, Sea. Average Water temperature 29 Celsius

Data collector: S. Antonio Hollingsworth Camera Used: BW Space Pro 4K Zoom Platform: Underwater Robot.

Acknowledgements

We wouldn't be here without the help of others. If you owe any attributions or thanks, include them here along with any citations of past research.

Thanks to:

The UNDP Accelerator Labs for Barbados & the Eastern Caribbean for funding The Blue-Bot Project.

Stacy R. Phillips for project proposal presentations.

S. Antonio Hollingsworth for piloting the remote underwater robot and curating the images of this dataset.

Youcan Robotics for there technical and customer support.

Those dear to us who inspire us to dream of a better tomorrow.

Code attributions:

tensorflow.org: MobileNet V2 pre-trained model used as in transfer learning process of BlueNet

python.org

Inspiration

How can we improve the data collection process in the blue economy?

What is the best way to use A.I. in the blue economy?

Can we use computer vision and artificial intelligence to find and learn the complex patterns that exist on coral reefs?

How do we use this insight to create effective and long term conservation and resilience policies for small island developing states that depend on coral reefs for economic survival?
f
Data from: Beyond the Scope of Free-Wilson Analysis: Building Interpretable...
acs.figshare.com
xlsx
Updated Jun 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hongming Chen; Lars Carlsson; Mats Eriksson; Peter Varkonyi; Ulf Norinder; Ingemar Nilsson (2023). Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms [Dataset]. http://doi.org/10.1021/ci4001376.s002
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/ci4001376.s002
Dataset updated
Jun 3, 2023
Dataset provided by
ACS Publications
Authors
Hongming Chen; Lars Carlsson; Mats Eriksson; Peter Varkonyi; Ulf Norinder; Ingemar Nilsson
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
A novel methodology was developed to build Free-Wilson like local QSAR models by combining R-group signatures and the SVM algorithm. Unlike Free-Wilson analysis this method is able to make predictions for compounds with R-groups not present in a training set. Eleven public data sets were chosen as test cases for comparing the performance of our new method with several other traditional modeling strategies, including Free-Wilson analysis. Our results show that the R-group signature SVM models achieve better prediction accuracy compared with Free-Wilson analysis in general. Moreover, the predictions of R-group signature models are also comparable to the models using ECFP6 fingerprints and signatures for the whole compound. Most importantly, R-group contributions to the SVM model can be obtained by calculating the gradient for R-group signatures. For most of the studied data sets, a significant correlation with that of a corresponding Free-Wilson analysis is shown. These results suggest that the R-group contribution can be used to interpret bioactivity data and highlight that the R-group signature based SVM modeling method is as interpretable as Free-Wilson analysis. Hence the signature SVM model can be a useful modeling tool for any drug discovery project.
f
Data from: Prediction of Actinide–Ligand Complex Stability Constants by...
acs.figshare.com
xlsx
Updated May 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junhong Li; Junqing Li; Ziyi Liu; Dongqi Wang (2025). Prediction of Actinide–Ligand Complex Stability Constants by Machine Learning [Dataset]. http://doi.org/10.1021/acs.jpca.5c01743.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.1021/acs.jpca.5c01743.s001
Dataset updated
May 8, 2025
Dataset provided by
ACS Publications
Authors
Junhong Li; Junqing Li; Ziyi Liu; Dongqi Wang
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
Sustainable application of nuclear energy requires efficient sequestration of actinides, which relies on extensive understanding of actinide–ligand interactions to guide rational design of ligands. Currently, the design of novel ligands adopts mainly the time-consuming and labor-intensive trial-and-error strategy and is impeded by the heavy-metal toxicity and radioactivity of actinides. The advancement of machine learning techniques brings new opportunities given a sensible choice of appropriate descriptors. In this study, by using the binding equilibrium constant (log K1) to represent the binding affinity of ligand with metal ion, 14 typical algorithms were used to train machine learning models toward accurate predictions of log K1 between actinide ions and ligands, among which the Gradient Boosting model outperforms the others, and the most relevant 15 out of the 282 descriptors of ligands, metals, and solvents were identified, encompassing key physicochemical properties of ligands, solvents, and metals. The Gradient Boosting model achieved R2 values of 0.98 and 0.93 on the training and test sets, respectively, showing its ability to establish qualitative correlations between the features and log K1 for accurate prediction of log K1 values. The impact of these properties on log K1 values was discussed, and a quantitative correlation was derived using the SISSO model. The model was then applied to eight recently reported ligands for Am3+, Cm3+, and Th4+ outside of the training set, and the predicted values agreed with the experimental ones. This study enriches the understanding of the fundamental properties of actinide–ligand interactions and demonstrates the feasibility of machine-learning-assisted discovery and design of ligands for actinides.
Code + simulated + publically accessable data for "Evaluating health...
zenodo.org
ai, bin, pdf
Updated Jul 22, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nicholas Tierney; Nicholas Tierney (2024). Code + simulated + publically accessable data for "Evaluating health facility access using Bayesian spatial models and location analysis methods" [Dataset]. http://doi.org/10.5281/zenodo.3260038
Explore at:
bin, pdf, aiAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.3260038
Dataset updated
Jul 22, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Nicholas Tierney; Nicholas Tierney
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
# README

These files contain r data objects and R files that represent the key details of the paper, "Evaluating health facility access using Bayesian spatial models and location analysis methods".

The following datasources are available for simulation of some of the ideas in the paper.

- dat_grid_sim: simulated data of the grid and grid cells
- dat_ohca_cv_sim: simulated data containing the cross validated test/training sets of OHCA data
- dat_ohca_sim: simulated OHCA event data
- dat_aed_sim: simulated AED location data
- dat_bldg_sim: simulated building location data
- dat_municipality_sim: simulated municipality information
- table_1: Table 1 information containing key demographic data

These data were produced using the code in 01-create-sim-data.R, and one of the statistical models is demonstrated in 02-demo-inla-model.R

In terms of the paper itself, the functions and code used in the manuscript are located in:

* 01_tidy.Rmd - analysis code used to tidy up the data

* 02_fit_fixed_all_cv.Rmd - analysis code used to place AEDs

* 02_model.Rmd - analysis code used to fit the model in INLA

* 03_manuscript.Rmd - Full code and text used to create the paper

* 04_supp_materials.Rmd - full code and text used to create the supplementary materials

The following files are a part of an R package "swatial" that was developed along with the paper. These files are:

* DESCRIPTION

* NAMESPACE

* LICENSE

* LICENSE.md

* decay.R

* spherical-distance.R

* test-figure-data-matches.R

* test-table-data-matches.R

* testthat.R

* tidy-inla.R

* tidy-posterior-coefs.R

* tidy-predictions.R

* utils-pipe.R

* All files that end in .Rd are documentation files for the functions.

## Regarding data sources

Census information for Ticino was transcribed from the Annual Statistical Report of Canton Ticino from years 2010 to 2015. This data was taken from their publicly accessible annual reports - for example: (https://www3.ti.ch/DFE/DR/USTAT/allegati/volume/ast_2015.pdf). The raw data was extracted from these annual reports, and placed into the file: "swiss_census_popn_2010_2015.xlsx". These data are put into analysis ready format in the file “01_tidy.Rmd”

Housing and other relevant geospatial data can be accessed via http://map.housing-stat.ch/ and https://data.geo.admin.ch/. The maps of buildings from the REA (Register of Buildings and Dwellings) can be found here: https://map.geo.admin.ch/?zoom=11&bgLayer=ch.swisstopo.pixelkarte-grau&lang=en&topic=ech&layers=ch.bfs.gebaeude_wohnungs_register,ch.swisstopo.swissboundaries3d-gemeinde-flaeche.fill,ch.bfs.volkszaehlung-gebaeudestatistik_gebaeude,ch.bfs.volkszaehlung-gebaeudestatistik_wohnungen,ch.swisstopo.swissbuildings3d_1.metadata,ch.swisstopo.swissbuildings3d_2.metadata&E=2717616.28&N=1096597.25&catalogNodes=687,696&layers_timestamp=,,2016,2016,,&layers_visibility=true,false,false,false,false,false&layers_opacity=1,1,1,1,1,0.75

For further enquiries on this data, contact the Swiss federal Office of Statistics at the details listed here: https://www.bfs.admin.ch/bfs/en/home/services/contact.html

The shapefiles of the Comuni can be accessed here: https://www4.ti.ch/dfe/de/ucr/documentazione/download-file/?noMobile=1

Data from the people living in the Municipalities in Ticino can be downloaded here: https://www3.ti.ch/DFE/DR/USTAT/index.php?fuseaction=dati.home&tema=33&id2=61&id3=65&c1=01&c2=02&c3=02

## Future work

In the future, these functions from the paper may be generalised and put into their own package. If that happens, this repository will be updated with a link to updated functions.
P
RoBo6 Dataset
paperswithcode.com
Updated Nov 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniel Kyselica; Marek Šuppa; Jiří Šilha; Roman Ďurikovič (2024). RoBo6 Dataset [Dataset]. https://paperswithcode.com/dataset/robo6
Explore at:
Dataset updated
Nov 29, 2024
Authors
Daniel Kyselica; Marek Šuppa; Jiří Šilha; Roman Ďurikovič
Description
Dataset contains light curves of 6 rocket body types from Mini Mega Tortora database (MMT)¹. The dataset was created to be used as a benchmark for rocket body light curve classification. For more informations follow the original paper: RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification²

Class labels: - ARIANE 5 R/B - ATLAS 5 CENTAUR R/B - CZ-3B R/B - DELTA 4 R/B - FALCON 9 R/B - H-2A R/B

Dataset description Usage ```python

from datasets import load_dataset

dataset = load_dataset("kyselica/RoBo6", data_files={"train": "train.csv", "test": "test.csv"}) dataset DatasetDict({ train: Dataset({ features: ['label', ' id', ' part', ' period', ' mag', ' phase', ' time'], num_rows: 5676 }) test: Dataset({ features: ['label', ' id', ' part', ' period', ' mag', ' phase', ' time'], num_rows: 1404 }) }) ```

label - class name id - unique identifier of the light curve from MMT part - part number of the light curve period - rotational period of the object mag - relative path to the magnitude values file phase - relative path to the phase values file time - relative path to the time values file

Mean and standard deviation of magnitudes are stored in mean_std.csv file.

File structure

data directory contains 5 subdirectories, one for each class. Light curves are stored in file triplets in the following format:

where

MMT Rocket Bodies ├── README.md ├── train.csv ├── test.csv ├── mean_std.csv ├── data │ ├── ARIANE 5 R_B │ │ ├──

Data preprocessing To create data sutable for both CNN and RNN based models, the light curves were preprocessed in the following way:

Split the light curves if the gap between two consecutive measurements is larger than object's rotational period. Split the light curves to have maximum span 1_000 seconds. Filter out light curves which folded form divided into 100 bins has more than 25% of bins empty. Resample the light curves to 10_000 points with step 0.1 seconds. Filter out light curves with less than 100 measurements.

Citation @article{kyselica2024robo6, title={RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification}, author={Kyselica, Daniel and {\v{S}}uppa, Marek and {\v{S}}ilha, Ji{\v{r}}{\'\i} and {\v{D}}urikovi{\v{c}}, Roman}, journal={arXiv preprint arXiv:2412.00544}, year={2024} }

References

Karpov, S., et al. "Mini-Mega-TORTORA wide-field monitoring system with sub-second temporal resolution: first year of operation." Revista Mexicana de Astronomía y Astrofísica 48 (2016): 91-96. ↩

RoBo6: Standardized MMT Light Curve Dataset for Rocket Body Classification ↩
A Curated List of Image Deblurring Datasets
kaggle.com
Updated Mar 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jishnu Parayil Shibu (2023). A Curated List of Image Deblurring Datasets [Dataset]. https://www.kaggle.com/datasets/jishnuparayilshibu/a-curated-list-of-image-deblurring-datasets
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 28, 2023
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jishnu Parayil Shibu
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Given a blurred image, image deblurring aims to produce a clear, high-quality image that accurately represents the original scene. Blurring can be caused by various factors such as camera shake, fast motion, out-of-focus objects, etc. making it a particularly challenging computer vision problem. This has led to the recent development of a large spectrum of deblurring models and unique datasets.

Despite the rapid advancement in image deblurring, the process of finding and pre-processing a number of datasets for training and testing purposes has been both time exhaustive and unnecessarily complicated for both experts and non-experts alike. Moreover, there is a serious lack of ready-to-use domain-specific datasets such as face and text deblurring datasets.

To this end, the following card contains a curated list of ready-to-use image deblurring datasets for training and testing various deblurring models. Additionally, we have created an extensive, highly customizable python package for single image deblurring called DBlur that can be used to train and test various SOTA models on the given datasets just with 2-3 lines of code.

Following is a list of the datasets that are currently provided: - GoPro: The GoPro dataset for deblurring consists of 3,214 blurred images with a size of 1,280×720 that are divided into 2,103 training images and 1,111 test images. - HIDE: HIDE is a motion-blurred dataset that includes 2025 blurred images for testing. It mainly focus on pedestrians and street scenes. - RealBlur: The RealBlur testing dataset consists of two subsets. The first is RealBlur-J, consisting of 1900 camera JPEG outputs. The second is RealBlur-R, consisting of 1900 RAW images. The RAW images are generated by using white balance, demosaicking, and denoising operations. - CelebA: A face deblurring dataset created using the CelebA dataset which consists of 2 000 000 training images, 1299 validation images, and 1300 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Helen: A face deblurring dataset created using the Helen dataset which consists of 2 000 training images, 155 validation images, and 155 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018 - Wider-Face: A face deblurring dataset created using the Wider-Face dataset which consists of 4080 training images, 567 validation images, and 567 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
- TextOCR: A text deblurring dataset created using the TextOCR dataset which consists of 5000 training images, 500 validation images, and 500 testing images. The blurred images were created using the blurred kernels provided by Shent et al. 2018
f
physioDL: A dataset for geomorphic deep learning representing a scene...
figshare.com
zip
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2024). physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hillshade occurs) [Dataset]. http://doi.org/10.6084/m9.figshare.26363824.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26363824.v2
Dataset updated
Jul 24, 2024
Dataset provided by
figshare
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hilshade occurs)Purpose: Datasets for geomorphic deep learning. Predict the physiographic region of an area based on a hillshade image. Terrain data were derived from the 30 m (1 arc-second) 3DEP product across the entirety of CONUS. Each chip has a spatial resolution of 30 m and 256 rows and columns of pixels. As a result, each chip measures 7,680 meters-by-7,680 meters. Two datasets are provided. Chips in the hs folder represent a multidirectional hillshade while chips in the ths folder represent a tinted multidirectional hillshade. Data are represented in 8-bit (0 to 255 scale, integer values). Data are projected to the Web Mercator projection relative to the WGS84 datum. Data were split into training, test, and validation partitions using stratified random sampling by region. 70% of the samples per region were selected for training, 15% for testing, and 15% for validation. There are a total of 16,325 chips. The following 22 physiographic regions are represented: "ADIRONDACK" , "APPALACHIAN PLATEAUS", "BASIN AND RANGE", "BLUE RIDGE", "CASCADE-SIERRA MOUNTAINS", "CENTRAL LOWLAND", "COASTAL PLAIN", "COLORADO PLATEAUS", "COLUMBIA PLATEAU", "GREAT PLAINS", "INTERIOR LOW PLATEAUS", "MIDDLE ROCKY MOUNTAINS", "NEW ENGLAND", "NORTHERN ROCKY MOUNTAINS", "OUACHITA", "OZARK PLATEAUS", "PACIFIC BORDER", and "PIEDMONT", "SOUTHERN ROCKY MOUNTAINS", "SUPERIOR UPLAND", "VALLEY AND RIDGE", "WYOMING BASIN". Input digital terrain models and hillshades are not provided due to the large file size (> 100GB). FilesphysioDL.csv: Table listing all image chips and associated physiographic region (id = unique ID for each chip; region = physiographic region; fnameHS = file name of associated chip in hs folder; fnameTHS = file name of associated chip in ths folder; set = data split (train, test, or validation).chipCounts.csv: Number of chips in each data partition per physiographic province. map.png: Map of data.makeChips.R: R script used to process the data into image chips and create CSV files.inputVectorschipBounds.shp = square extent of each chipchipCenters.shp = center coordinate of each chipprovinces.shp = physiographic provincesprovinces10km.shp = physiographic provinces with a 10 km negative buffer
Rescaled Fashion-MNIST with translations dataset
zenodo.org
Updated Jun 27, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST with translations dataset [Dataset]. http://doi.org/10.5281/zenodo.15188439
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188439
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST with translations dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data, and to additionally provide a way to test network object detection and object localisation abilities on image data where the objects are not centred.

The Rescaled Fashion-MNIST with translations dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST with translations dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST with translations dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST with translations dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72. The objects within the images have also been randomly shifted in the spatial domain, with the object always at least 4 pixels away from the image boundary. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_and_translations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST with translations dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_and_translations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figure 8 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST dataset, which in addition to scaling variations keeps the objects in the frame centred, meaning no spatial translations are used.
Daily News for Stock Market Prediction
kaggle.com
zip
Updated Nov 13, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron7sun (2019). Daily News for Stock Market Prediction [Dataset]. https://www.kaggle.com/datasets/aaron7sun/stocknews/discussion/41925
Explore at:
zip(6097730 bytes)Available download formats
Dataset updated
Nov 13, 2019
Authors
Aaron7sun
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Actually, I prepare this dataset for students on my Deep Learning and NLP course.

But I am also very happy to see kagglers play around with it.

Have fun!

Description:

There are two channels of data provided in this dataset:

News data: I crawled historical news headlines from Reddit WorldNews Channel (/r/worldnews). They are ranked by reddit users' votes, and only the top 25 headlines are considered for a single date. (Range: 2008-06-08 to 2016-07-01)

Stock data: Dow Jones Industrial Average (DJIA) is used to "prove the concept". (Range: 2008-08-08 to 2016-07-01)

I provided three data files in .csv format:

RedditNews.csv: two columns The first column is the "date", and second column is the "news headlines". All news are ranked from top to bottom based on how hot they are. Hence, there are 25 lines for each date.

DJIA_table.csv: Downloaded directly from Yahoo Finance: check out the web page for more info.

Combined_News_DJIA.csv: To make things easier for my students, I provide this combined dataset with 27 columns. The first column is "Date", the second is "Label", and the following ones are news headlines ranging from "Top1" to "Top25".

=========================================

To my students:

I made this a binary classification task. Hence, there are only two labels:

"1" when DJIA Adj Close value rose or stayed as the same;

"0" when DJIA Adj Close value decreased.

For task evaluation, please use data from 2008-08-08 to 2014-12-31 as Training Set, and Test Set is then the following two years data (from 2015-01-02 to 2016-07-01). This is roughly a 80%/20% split.

And, of course, use AUC as the evaluation metric.

=========================================

+++++++++++++++++++++++++++++++++++++++++

To all kagglers:

Please upvote this dataset if you like this idea for market prediction.

If you think you coded an amazing trading algorithm,

friendly advice

do play safe with your own money :)

+++++++++++++++++++++++++++++++++++++++++

Feel free to contact me if there is any question~

And, remember me when you become a millionaire :P

Note: If you'd like to cite this dataset in your publications, please use:

Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved [Date You Retrieved This Data] from https://www.kaggle.com/aaron7sun/stocknews.
Pistachio Dataset
kaggle.com
Updated Apr 3, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Murat KOKLU (2022). Pistachio Dataset [Dataset]. https://www.kaggle.com/datasets/muratkokludataset/pistachio-dataset/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 3, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Murat KOKLU
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Pistachio Image Dataset https://www.kaggle.com/datasets/muratkokludataset/pistachio-image-dataset

DATASET: https://www.muratkoklu.com/datasets/

Citation Request :

OZKAN IA., KOKLU M. and SARACOGLU R. (2021). Classification of Pistachio Species Using Improved K-NN Classifier. Progress in Nutrition, Vol. 23, N. 2, pp. DOI:10.23751/pn.v23i2.9686. (Open Access) https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178

SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

Article Download (PDF): 1: https://www.mattioli1885journals.com/index.php/progressinnutrition/article/view/9686/9178 2: https://doi.org/10.3390/electronics11070981

ABSTRACT: In order to keep the economic value of pistachio nuts which have an important place in the agricultural economy, the efficiency of post-harvest industrial processes is very important. To provide this efficiency, new methods and technologies are needed for the separation and classification of pistachios. Different pistachio species address different markets, which increases the need for the classification of pistachio species. In this study, it is aimed to develop a classification model different from traditional separation methods, based on image processing and artificial intelligence which are capable to provide the required classification. A computer vision system has been developed to distinguish two different species of pistachios with different characteristics that address different market types. 2148 sample image for these two kinds of pistachios were taken with a high-resolution camera. The image processing techniques, segmentation and feature extraction were applied on the obtained images of the pistachio samples. A pistachio dataset that has sixteen attributes was created. An advanced classifier based on k-NN method, which is a simple and successful classifier, and principal component analysis was designed on the obtained dataset. In this study; a multi-level system including feature extraction, dimension reduction and dimension weighting stages has been proposed. Experimental results showed that the proposed approach achieved a classification success of 94.18%. The presented high-performance classification model provides an important need for the separation of pistachio species and increases the economic value of species. In addition, the developed model is important in terms of its application to similar studies. Keywords: Classification, Image processing, k nearest neighbor classifier, Pistachio species

SINGH D, TASPINAR YS, KURSUN R, CINAR I, KOKLU M, OZKAN IA, LEE H-N., (2022). Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models, Electronics, 11 (7), 981. https://doi.org/10.3390/electronics11070981. (Open Access)

ABSTRACT: Pistachio is a shelled fruit from the anacardiaceae family. The homeland of pistachio is the Middle East. The Kirmizi pistachios and Siirt pistachios are the major types grown and exported in Turkey. Since the prices, tastes, and nutritional values of these types differs, the type of pistachio becomes important when it comes to trade. This study aims to identify these two types of pistachios, which are frequently grown in Turkey, by classifying them via convolutional neural networks. Within the scope of the study, images of Kirmizi and Siirt pistachio types were obtained through the computer vision system. The pre-trained dataset includes a total of 2148 images, 1232 of Kirmizi type and 916 of Siirt type. Three different convolutional neural network models were used to classify these images. Models were trained by using the transfer learning method, with AlexNet and the pre-trained models VGG16 and VGG19. The dataset is divided as 80% training and 20% test. As a result of the performed classifications, the success rates obtained from the AlexNet, VGG16, and VGG19 models are 94.42%, 98.84%, and 98.14%, respectively. Models’ performances were evaluated through sensitivity, specificity, precision, and F-1 score metrics. In addition, ROC curves and AUC values were used in the performance evaluation. The highest classification success was achieved with the VGG16 model. The obtained results reveal that these methods can be used successfully in the determination of pistachio types. Keywords: pistachio; genetic varieties; machine learning; deep learning; food recognition

https://www.muratkoklu.com/datasets/
Data and script pipeline for: Common to rare transfer learning (CORAL)...
zenodo.org
bin, html, tsv
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.15524215
Explore at:
bin, html, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15524215
Dataset updated
May 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Otso Ovaskainen; Otso Ovaskainen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

System requirements

· The software can be used in any operating system where R can be installed.

· We have developed and tested the software in a windows environment with R version 4.3.1.

· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

· The use of the software does not require any non-standard hardware.

Installation guide

· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

Demo 1: Software demo with simulated data

The software demonstration consists of two R-markdown files:

· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

Demo 2: Software demo with a small subset of the data used in the paper

The software demonstration consists of one R-markdown file:

MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

Scripts and data for reproducing the results presented in the paper (Demo 3)

The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

· S03_import_posterior – imports the posterior distributions sampled by the initial model.

· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

· S06_construct_coral_priors – calculate CORAL prior parameters.

The remaining scripts evaluate the model:

· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R² produced by cross-validation. Generates Fig. 4 of the paper.

· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

Pre-processing scripts:

· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

ENA Accession numbers

All raw sequence data are archived on mBRAVE and are publicly available in the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena; project accession number PRJEB86111; run accession numbers ERR15018787-ERR15009869; sample IDs for each accession and download URLs are provided in the file ENA_read_accessions.tsv).
f
Every model’s predictions and the truth values of the test data.
plos.figshare.com
txt
Updated Jun 4, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nathaniel Diamant; Erik Reinertsen; Steven Song; Aaron D. Aguirre; Collin M. Stultz; Puneet Batra (2023). Every model’s predictions and the truth values of the test data. [Dataset]. http://doi.org/10.1371/journal.pcbi.1009862.s007
Explore at:
txtAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pcbi.1009862.s007
Dataset updated
Jun 4, 2023
Dataset provided by
PLOS Computational Biology
Authors
Nathaniel Diamant; Erik Reinertsen; Steven Song; Aaron D. Aguirre; Collin M. Stultz; Puneet Batra
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Fig 7, Table 1, and Fig 8 were produced using S1_Data.csv, which contains predictions and truth labels for each model and task. (CSV)
UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)
zenodo.org
bin, zip
Updated Dec 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth (2023). UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA) [Dataset]. http://doi.org/10.5281/zenodo.6476639
Explore at:
zip, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6476639
Dataset updated
Dec 11, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth; Luca Giancardo; Luca Giancardo; Ivan Coronado; Samiksha Pachade; Rania Abdelkhaleq; Juntao Yan; Sergio Salazar-Marioni; Amanda Jagolino; Mozhdeh Bahrainian; Roomasa Channa; Sunil Sheth
Description
Introduction

Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.

We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.

In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".

Fundus Imaging

We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.

The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.

Enface OCT-A

We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.

En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Synthetic OCT-A

We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.

The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.

Other Fundus Vessel Segmentations Included

In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).

SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.

IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.

Train/Validation/Test Replication

In order to replicate or compare your model to the results of our paper, we report below the data split used.

Training subjects IDs: 1 - 25

Validation subjects IDs: 26 - 30

Testing subjects IDs: 31 - 112

Data Acquisition

This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.

User Agreement

The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.

Funding

This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.

Research Team and Acknowledgements

Here are the people behind this data acquisition effort:

Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo

We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.

References

Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9. C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346. L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621. Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
f
Data Sheet 1_An investigation of the load-velocity relationship between...
frontiersin.figshare.com
figshare.com
csv
Updated May 30, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ziwei Zhu; Jiayong Chen; Ruize Sun; Renchen Wang; Jiaxin He; Wenfeng Zhang; Weilong Lin; Duanying Li (2025). Data Sheet 1_An investigation of the load-velocity relationship between flywheel eccentric and barbell training methods.csv [Dataset]. http://doi.org/10.3389/fpubh.2025.1579291.s001
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.3389/fpubh.2025.1579291.s001
Dataset updated
May 30, 2025
Dataset provided by
Frontiers
Authors
Ziwei Zhu; Jiayong Chen; Ruize Sun; Renchen Wang; Jiaxin He; Wenfeng Zhang; Weilong Lin; Duanying Li
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
ObjectiveFlywheel resistance training (FRT) is a training modality for developing lower limb athletic performance. The relationship between FRT load parameters and barbell squat loading remains ambiguous in practice, resulting in experience-driven load selection during training. Therefore, this study investigates optimal FRT loading for specific training goals (maximal strength, power, muscular endurance) by analyzing concentric velocity at varying barbell 1RM percentages (%1RM), establishes correlations between flywheel load, velocity, and %1RM, and integrates force-velocity profiling to develop evidence-based guidelines for individualized load prescription.MethodsThirty-nine participants completed 1RM barbell squats to establish submaximal loads (20–90%1RM). Concentric velocities were monitored via linear-position transducer (Gymaware) for FRT inertial load quantification, with test–retest measurements confirming protocol reliability. Simple and multiple linear regression modeled load-velocity interactions and multivariable relationships, while Pearson’s r and R2 quantified correlations and model fit. Predictive equations estimated inertial loads (kg·m2), supported by ICC (2, 1) and CV assessments of relative/absolute reliability.ResultsA strong inverse correlation (r = −0.88) and high linearity (R2 = 0.78) emerged between rotational inertia and velocity. The multivariate model demonstrated excellent fit (R2 = 0.81) and robust correlation (r = 0.90), yielding the predictive equation: y = 0.769–0.846v + 0.002 kg.ConclusionThe strong linear inertial load-velocity relationship enables individualized load prescription through regression equations incorporating velocity and strength parameters. While FRT demonstrates limited efficacy for developing speed-strength, its longitudinal periodization effects require further investigation. Optimal FRT loading ranges were identified: 40–60%1RM for strength-speed, 60–80%1RM for power development, and 80–100% + 1RM for maximal strength adaptations.
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Facebook

Twitter

Click to copy link

Link copied

Cite

Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748

Rescaled CIFAR-10 dataset

Explore at:

6 scholarly articles cite this dataset (View in Google Scholar)

Unique identifier

https://doi.org/10.5281/zenodo.15188748

Dataset updated

Jun 27, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg

Description

Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

Clear search

Close search

Google apps

Main menu

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Fruits-360 dataset

Fruits-360 dataset: A dataset of images containing fruits, vegetables, nuts and seeds

Version: 2025.06.07.0

Content

Branches

How to cite

Dataset properties

For the 100x100 branch

For the original-size branch

For the 3-body-problem branch

For the meta branch

For the multi branch

Filename format:

For the 100x100 branch

For the original-size branch

For the multi branch

Alternate download

How fruits were filmed

Movie Review Sentiment and Rationale Dataset

Columns

Distribution

Usage

Coverage

License

Who Can Use It

Dataset Name Suggestions

Attributes

Blue Bot Dataset: Train, Test, Validate

Context

Content

Acknowledgements

Code attributions:

Inspiration

Data from: Beyond the Scope of Free-Wilson Analysis: Building Interpretable...

Data from: Prediction of Actinide–Ligand Complex Stability Constants by...

Code + simulated + publically accessable data for "Evaluating health...

RoBo6 Dataset

A Curated List of Image Deblurring Datasets

physioDL: A dataset for geomorphic deep learning representing a scene...

Rescaled Fashion-MNIST with translations dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Daily News for Stock Market Prediction

Pistachio Dataset

Data and script pipeline for: Common to rare transfer learning (CORAL)...

Every model’s predictions and the truth values of the test data.

UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA)

Data Sheet 1_An investigation of the load-velocity relationship between...

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set