3 datasets found

original : CIFAR 100
kaggle.com
zip
Updated Dec 28, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100
Explore at:
zip(168517945 bytes)Available download formats
Dataset updated
Dec 28, 2024
Authors
Shashwat Pandey
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...
Hyperparameter-Tuned Crop Yield ML Dataset
kaggle.com
zip
Updated Jul 15, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaryan Mavani_new (2025). Hyperparameter-Tuned Crop Yield ML Dataset [Dataset]. https://www.kaggle.com/datasets/aaryanmavaninew/hyperparameter-tuned-crop-yield-ml-dataset
Explore at:
zip(487518 bytes)Available download formats
Dataset updated
Jul 15, 2025
Authors
Aaryan Mavani_new
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Dataset Description This dataset presents historical crop yield data curated for the purpose of applying machine learning models—especially those involving hyperparameter tuning—to improve crop yield forecasting accuracy.

Context & Objective Agriculture is the backbone of many economies, and crop yield prediction plays a vital role in ensuring food security, optimizing resource use, and planning logistics. However, traditional statistical methods often fail to capture nonlinear relationships between multiple agricultural factors. This dataset enables the application of modern ML techniques to achieve more reliable yield forecasts.

The main objective behind compiling this dataset is to support academic, research, and industry projects focusing on:

Crop yield regression modeling

Model comparison and evaluation

Hyperparameter tuning (Grid Search, Random Search, Bayesian Optimization)

Agricultural decision support systems

Sustainable farming solutions using AI

Source Information This dataset is based on open government agricultural records from India, particularly data released by: The Ministry of Agriculture & Farmers Welfare, Government of India State-level agricultural departments and public datasets

It has been cleaned, preprocessed, and standardized to be ML-ready, with categorical encoding and structured formats suitable for both beginner and advanced ML workflows.

Dataset Structure Key features in the dataset include:

State_Name: Name of the Indian state

District_Name: Name of the district

Crop_Year: Year of cultivation

Season: Season of planting (e.g., Kharif, Rabi)

Crop: Name of the crop (e.g., Rice, Wheat, Cotton)

Area: Cultivation area in hectares

Production: Actual production in tonnes

You can apply normalization, feature engineering, and one-hot encoding for categorical variables during preprocessing.

Inspiration This dataset was inspired by the need to build AI-driven systems for smarter agriculture, especially those that can generalize across time periods, regions, and crop types. It is also intended to aid students and practitioners in learning about:

ML model pipelines

The impact of feature importance in agriculture

Model optimization through hyperparameter tuning

Addressing real-world challenges in agricultural data (e.g., missing values, seasonality)
BMO-GNN dataset
kaggle.com
zip
Updated Jan 22, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jangseop Park (2025). BMO-GNN dataset [Dataset]. https://www.kaggle.com/datasets/jangseop/bmo-gnn-dataset/versions/1
Explore at:
zip(89033056 bytes)Available download formats
Dataset updated
Jan 22, 2025
Authors
Jangseop Park
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
For more details about the dataset and its applications, please refer to our GitHub repository.

The BMO-GNN Dataset is curated to facilitate research on Bayesian Mesh Optimization and Graph Neural Networks for engineering performance prediction, specifically targeting applications in road wheel design. This dataset consists of multiple 3D wheel CAD models converted into graph structures, capturing the geometry (node coordinates) and connectivity (adjacency) necessary for GNN-based surrogate modeling.

High-fidelity finite element analyses (FEA) were performed to obtain mass, rim stiffness, and disk stiffness labels, which serve as ground truth for training and evaluating GNNs. By leveraging re-meshing and clustering techniques, each wheel geometry is represented in a graph form, allowing researchers to explore mesh-resolution effects on predictive accuracy through Bayesian Optimization.

Key Features

Graph Representations of 3D Wheels: - Each 3D CAD wheel is converted into a graph (via subdividing and clustering), resulting in a node–edge structure rather than traditional voxel, point cloud, or B-Rep data.

Label Data from FEA: - Mass (kg)
- Rim Stiffness (kgf/mm)
- Disk Stiffness (kgf/mm)

Diverse Geometric Variations: - Over 900 distinct wheel designs, each having unique shapes and structural properties.

Mesh Quality Variation: - Subdivision and clustering parameters (e.g., num_subdivide, num_cluster) are varied to produce different mesh qualities—valuable for studying the trade-off between model accuracy and computational cost.

Designed for Bayesian Optimization + GNN: - The dataset structure (graphs.pkl) supports iterative mesh-resolution optimization, making it ideal for advanced surrogate modeling, hyperparameter tuning, and robust performance prediction in automotive or mechanical contexts.

Dataset Generation

Geometry Acquisition - We collected 3D CAD wheel models reflecting a broad range of shapes and design parameters.
- CAD files were processed using Python/Open3D to create initial polygon meshes.

FEA-based Label Computation
- Altair SimLab (or comparable CAE tools) performed modal or structural analyses.
- For each wheel, finite element solutions yielded the mass, rim stiffness, and disk stiffness.
- Tetrahedral mesh convergence was verified for accuracy in labeling.

Mesh to Graph Conversion
- Polygon meshes were subdivided (to refine detail) and clustered (to control node count) through pyacvd or a similar library, creating consistent mesh resolutions.
- Resulting re-meshed data were then converted into adjacency matrices (edge connections) and node-coordinate matrices (XYZ) for GNN input.

Dataset Packaging (graphs.pkl)
- All graph data and corresponding normalized labels (mass, rim stiffness, disk stiffness) are compiled into a single serialized file.
- Graph elements include node coordinates, adjacency matrices, and shape IDs to trace back to original wheels if needed.

Data Preprocessing

Metric Minimum Maximum Average
Number of nodes ~600 ~1700 ~1000
Number of edges ~1900 ~5100 ~3300
Number of faces(*) ~1300 ~4200 ~2200
Mass (kg) ~15 ~20 ~17.5

(*) “Faces” here refer to the triangular faces in the polygon mesh before conversion. Depending on subdivision/clustering parameters, these numbers vary significantly.

Train–Test Split
- Typically, an 80–10–10 split (train–validation–test) is used.
- Min–Max scaling is applied to both node features (XYZ) and labels (mass, rim stiffness, disk stiffness).

File: graphs.pkl
- Contains a list of graph objects, each with: - Node feature matrix (N × 3 for XYZ coordinates, normalized)
- Adjacency matrix (N × N, storing edge weights or connectivity)
- Label (mass, rim stiffness, or disk stiffness, normalized)

Usage Examples

GNN Surrogate Modeling - Researchers can feed graphs.pkl directly into frameworks like Spektral or PyTorch Geometric to train or evaluate a GNN for predicting mechanical performance.

Mesh Resolution Studies
- By comparing re-meshed versions, one can analyze how node count and clustering influence prediction accuracy and computational time.

Bayesian Optimization Experiments
- Ideal for iterative search of “best” subdivision/clustering parameters, balancing accuracy vs. training cost.

References

Altair SimLab – used for CAE automation and finite element analyses.

Citation

If you find the BMO-GNN Dataset useful, please cite:

@article{pa...
Not seeing a result you expected?
Learn how you can add new datasets to our index.

Metric	Minimum	Maximum	Average
Number of nodes	~600	~1700	~1000
Number of edges	~1900	~5100	~3300
Number of faces(*)	~1300	~4200	~2200
Mass (kg)	~15	~20	~17.5

Facebook

Twitter

Click to copy link

Link copied

Cite

Shashwat Pandey (2024). original : CIFAR 100 [Dataset]. https://www.kaggle.com/datasets/shashwat90/original-cifar-100

original : CIFAR 100

Explore at:

141 scholarly articles cite this dataset (View in Google Scholar)

zip(168517945 bytes)Available download formats

Dataset updated

Dec 28, 2024

Authors

Shashwat Pandey

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

The CIFAR-10 and CIFAR-100 datasets are labeled subsets of the 80 million tiny images dataset. CIFAR-10 and CIFAR-100 were created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. (Sadly, the 80 million tiny images dataset has been thrown into the memory hole by its authors. Spotting the doublethink which was used to justify its erasure is left as an exercise for the reader.)

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The classes are completely mutually exclusive. There is no overlap between automobiles and trucks. "Automobile" includes sedans, SUVs, things of that sort. "Truck" includes only big trucks. Neither includes pickup trucks.

Baseline results You can find some baseline replicable results on this dataset on the project page for cuda-convnet. These results were obtained with a convolutional neural network. Briefly, they are 18% test error without data augmentation and 11% with. Additionally, Jasper Snoek has a new paper in which he used Bayesian hyperparameter optimization to find nice settings of the weight decay and other hyperparameters, which allowed him to obtain a test error rate of 15% (without data augmentation) using the architecture of the net that got 18%.

Other results Rodrigo Benenson has collected results on CIFAR-10/100 and other datasets on his website; click here to view.

Dataset layout Python / Matlab versions I will describe the layout of the Python version of the dataset. The layout of the Matlab version is identical.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle. Here is a python2 routine which will open such a file and return a dictionary: python def unpickle(file): import cPickle with open(file, 'rb') as fo: dict = cPickle.load(fo) return dict And a python3 version: def unpickle(file): import pickle with open(file, 'rb') as fo: dict = pickle.load(fo, encoding='bytes') return dict Loaded in this way, each of the batch files contains a dictionary with the following elements: data -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image. labels -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries: label_names -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, label_names[0] == "airplane", label_names[1] == "automobile", etc. Binary version The binary version contains the files data_batch_1.bin, data_batch_2.bin, ..., data_batch_5.bin, as well as test_batch.bin. Each of these files is formatted as follows: <1 x label><3072 x pixel> ... <1 x label><3072 x pixel> In other words, the first byte is the label of the first image, which is a number in the range 0-9. The next 3072 bytes are the values of the pixels of the image. The first 1024 bytes are the red channel values, the next 1024 the green, and the final 1024 the blue. The values are stored in row-major order, so the first 32 bytes are the red channel values of the first row of the image.

Each file contains 10000 such 3073-byte "rows" of images, although there is nothing delimiting the rows. Therefore each file should be exactly 30730000 bytes long.

There is another file, called batches.meta.txt. This is an ASCII file that maps numeric labels in the range 0-9 to meaningful class names. It is merely a list of the 10 class names, one per row. The class name on row i corresponds to numeric label i.

The CIFAR-100 dataset This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). Her...

Clear search

Close search

Google apps

Main menu

original : CIFAR 100

Hyperparameter-Tuned Crop Yield ML Dataset

BMO-GNN dataset

Key Features

Dataset Generation

Data Preprocessing

Usage Examples

References

Citation

original : CIFAR 100