Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset Description This dataset presents historical crop yield data curated for the purpose of applying machine learning models—especially those involving hyperparameter tuning—to improve crop yield forecasting accuracy.
Context & Objective Agriculture is the backbone of many economies, and crop yield prediction plays a vital role in ensuring food security, optimizing resource use, and planning logistics. However, traditional statistical methods often fail to capture nonlinear relationships between multiple agricultural factors. This dataset enables the application of modern ML techniques to achieve more reliable yield forecasts.
The main objective behind compiling this dataset is to support academic, research, and industry projects focusing on:
Source Information This dataset is based on open government agricultural records from India, particularly data released by: The Ministry of Agriculture & Farmers Welfare, Government of India State-level agricultural departments and public datasets
It has been cleaned, preprocessed, and standardized to be ML-ready, with categorical encoding and structured formats suitable for both beginner and advanced ML workflows.
Dataset Structure Key features in the dataset include:
You can apply normalization, feature engineering, and one-hot encoding for categorical variables during preprocessing.
Inspiration This dataset was inspired by the need to build AI-driven systems for smarter agriculture, especially those that can generalize across time periods, regions, and crop types. It is also intended to aid students and practitioners in learning about:
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
For more details about the dataset and its applications, please refer to our GitHub repository.
The BMO-GNN Dataset is curated to facilitate research on Bayesian Mesh Optimization and Graph Neural Networks for engineering performance prediction, specifically targeting applications in road wheel design. This dataset consists of multiple 3D wheel CAD models converted into graph structures, capturing the geometry (node coordinates) and connectivity (adjacency) necessary for GNN-based surrogate modeling.
High-fidelity finite element analyses (FEA) were performed to obtain mass, rim stiffness, and disk stiffness labels, which serve as ground truth for training and evaluating GNNs. By leveraging re-meshing and clustering techniques, each wheel geometry is represented in a graph form, allowing researchers to explore mesh-resolution effects on predictive accuracy through Bayesian Optimization.
Graph Representations of 3D Wheels: - Each 3D CAD wheel is converted into a graph (via subdividing and clustering), resulting in a node–edge structure rather than traditional voxel, point cloud, or B-Rep data.
Label Data from FEA:
- Mass (kg)
- Rim Stiffness (kgf/mm)
- Disk Stiffness (kgf/mm)
Diverse Geometric Variations: - Over 900 distinct wheel designs, each having unique shapes and structural properties.
Mesh Quality Variation:
- Subdivision and clustering parameters (e.g., num_subdivide, num_cluster) are varied to produce different mesh qualities—valuable for studying the trade-off between model accuracy and computational cost.
Designed for Bayesian Optimization + GNN:
- The dataset structure (graphs.pkl) supports iterative mesh-resolution optimization, making it ideal for advanced surrogate modeling, hyperparameter tuning, and robust performance prediction in automotive or mechanical contexts.
Geometry Acquisition
- We collected 3D CAD wheel models reflecting a broad range of shapes and design parameters.
- CAD files were processed using Python/Open3D to create initial polygon meshes.
FEA-based Label Computation
- Altair SimLab (or comparable CAE tools) performed modal or structural analyses.
- For each wheel, finite element solutions yielded the mass, rim stiffness, and disk stiffness.
- Tetrahedral mesh convergence was verified for accuracy in labeling.
Mesh to Graph Conversion
- Polygon meshes were subdivided (to refine detail) and clustered (to control node count) through pyacvd or a similar library, creating consistent mesh resolutions.
- Resulting re-meshed data were then converted into adjacency matrices (edge connections) and node-coordinate matrices (XYZ) for GNN input.
Dataset Packaging (graphs.pkl)
- All graph data and corresponding normalized labels (mass, rim stiffness, disk stiffness) are compiled into a single serialized file.
- Graph elements include node coordinates, adjacency matrices, and shape IDs to trace back to original wheels if needed.
| Metric | Minimum | Maximum | Average |
|---|---|---|---|
| Number of nodes | ~600 | ~1700 | ~1000 |
| Number of edges | ~1900 | ~5100 | ~3300 |
| Number of faces(*) | ~1300 | ~4200 | ~2200 |
| Mass (kg) | ~15 | ~20 | ~17.5 |
(*) “Faces” here refer to the triangular faces in the polygon mesh before conversion. Depending on subdivision/clustering parameters, these numbers vary significantly.
Train–Test Split
- Typically, an 80–10–10 split (train–validation–test) is used.
- Min–Max scaling is applied to both node features (XYZ) and labels (mass, rim stiffness, disk stiffness).
File: graphs.pkl
- Contains a list of graph objects, each with:
- Node feature matrix (N × 3 for XYZ coordinates, normalized)
- Adjacency matrix (N × N, storing edge weights or connectivity)
- Label (mass, rim stiffness, or disk stiffness, normalized)
GNN Surrogate Modeling
- Researchers can feed graphs.pkl directly into frameworks like Spektral or PyTorch Geometric to train or evaluate a GNN for predicting mechanical performance.
Mesh Resolution Studies
- By comparing re-meshed versions, one can analyze how node count and clustering influence prediction accuracy and computational time.
Bayesian Optimization Experiments
- Ideal for iterative search of “best” subdivision/clustering parameters, balancing accuracy vs. training cost.
If you find the BMO-GNN Dataset useful, please cite:
@article{pa...
Not seeing a result you expected?
Learn how you can add new datasets to our index.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.
The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.
Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.
Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.
Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.
The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary:
python
def unpickle(file):
import cPickle
with open(file, 'rb') as fo:
dict = cPickle.load(fo)
return dict
And a python3 version:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
Loaded in this way, each of the batch files contains a dictionary with the following elements:
data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.
labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.
The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.
Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.
There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.
The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...