44 datasets found
  1. cnn_c1

    • kaggle.com
    Updated Mar 9, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    satya (2021). cnn_c1 [Dataset]. https://www.kaggle.com/satyapr/cnn-c1/code
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 9, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    satya
    Description

    Dataset

    This dataset was created by satya

    Contents

  2. h

    mnist1d

    • huggingface.co
    • opendatalab.com
    Updated Oct 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Akiki (2024). mnist1d [Dataset]. https://huggingface.co/datasets/christopher/mnist1d
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Oct 9, 2024
    Authors
    Christopher Akiki
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    [!NOTE] This dataset card is based on the README file of the authors' GitHub repository: https://github.com/greydanus/mnist1d

      The MNIST-1D Dataset
    

    Most machine learning models get around the same ~99% test accuracy on MNIST. The MNIST-1D dataset is 100x smaller (default sample size: 4000+1000; dimensionality: 40) and does a better job of separating between models with/without nonlinearity and models with/without spatial inductive biases. MNIST-1D is a core teaching dataset in… See the full description on the dataset page: https://huggingface.co/datasets/christopher/mnist1d.

  3. h

    mnist_augmented

    • huggingface.co
    Updated Jul 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Anis Ur Rahman (2025). mnist_augmented [Dataset]. https://huggingface.co/datasets/ianisdev/mnist_augmented
    Explore at:
    Dataset updated
    Jul 27, 2025
    Authors
    Muhammad Anis Ur Rahman
    License

    MIT Licensehttps://opensource.org/licenses/MIT
    License information was derived automatically

    Description

    Dataset Card for mnist_augmented

    This dataset contains augmented versions of the MNIST dataset, created to benchmark how various augmentation strategies impact digit classification accuracy using deep learning models. The dataset is provided as a .zip file and must be unzipped before use. It follows the ImageFolder structure compatible with PyTorch and other DL frameworks.

      📥 Download & Extract
    

    wget… See the full description on the dataset page: https://huggingface.co/datasets/ianisdev/mnist_augmented.

  4. f

    Model comparison results using MNIST-C and MNIST-C-shape datasets.

    • plos.figshare.com
    xls
    Updated Jun 13, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seoyoung Ahn; Hossein Adeli; Gregory J. Zelinsky (2024). Model comparison results using MNIST-C and MNIST-C-shape datasets. [Dataset]. http://doi.org/10.1371/journal.pcbi.1012159.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 13, 2024
    Dataset provided by
    PLOS Computational Biology
    Authors
    Seoyoung Ahn; Hossein Adeli; Gregory J. Zelinsky
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Recognition accuracy (means and standard deviations from 5 trained models, hereafter referred to as model “runs”) from ORA and two CNN baselines, both of which were trained using identical CNN encoders (one a 2-layer CNN and the other a Resnet-18), and a CapsNet model following the implementation in [51].

  5. MNIST NET

    • kaggle.com
    Updated Feb 9, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ritvik Rastogi (2022). MNIST NET [Dataset]. https://www.kaggle.com/datasets/ritvik1909/mnist-net/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Feb 9, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Ritvik Rastogi
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset contains SOTA models finetuned on MNIST Handwritten Digits Classification task. The inspiration behind this was to implement FID for evaluation of GANs trained on MNIST data.

    Contents: * mnist_net: Mobile Net V2 model, 98.7% accurate

  6. f

    Different implementations on MNIST object detection accuracy (%) with input...

    • plos.figshare.com
    xls
    Updated Dec 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Reyhane Ahmadi; Amirreza Ahmadnejad; Somayyeh Koohi (2024). Different implementations on MNIST object detection accuracy (%) with input image size. [Dataset]. http://doi.org/10.1371/journal.pone.0313547.t002
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Dec 30, 2024
    Dataset provided by
    PLOS ONE
    Authors
    Reyhane Ahmadi; Amirreza Ahmadnejad; Somayyeh Koohi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Different implementations on MNIST object detection accuracy (%) with input image size.

  7. MNIST Preprocessed

    • kaggle.com
    Updated Jul 24, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valentyn Sichkar (2019). MNIST Preprocessed [Dataset]. https://www.kaggle.com/valentynsichkar/mnist-preprocessed/kernels
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 24, 2019
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Valentyn Sichkar
    Description

    đź“° Related Paper

    Sichkar V. N. Effect of various dimension convolutional layer filters on traffic sign classification accuracy. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, 2019, vol. 19, no. 3, pp. DOI: 10.17586/2226-1494-2019-19-3-546-552 (Full-text available here ResearchGate.net/profile/Valentyn_Sichkar)

    Test online with custom Traffic Sign here: https://valentynsichkar.name/mnist.html


    :mortar_board: Related course for classification tasks

    Design, Train & Test deep CNN for Image Classification. Join the course & enjoy new opportunities to get deep learning skills: https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/slideshow_classification.gif?raw=true%20=470x516" alt="CNN Course" title="CNN Course">


    🗺️ Concept Map of the Course

    https://github.com/sichkar-valentyn/1-million-images-for-Traffic-Signs-Classification-tasks/blob/main/images/concept_map.png?raw=true%20=570x410" alt="Concept map" title="Concept map">


    👉 Join the Course

    https://www.udemy.com/course/convolutional-neural-networks-for-image-classification/


    Content

    This is ready to use preprocessed data saved into pickle file.
    Preprocessing stages are as follows:
    - Normalizing whole data by dividing / 255.0.
    - Dividing whole data into three datasets: train, validation and test.
    - Normalizing whole data by subtracting mean image and dividing by standard deviation.
    - Transposing every dataset to make channels come first.


    mean image and standard deviation were calculated from train dataset and applied to all datasets.
    When using user's image for classification, it has to be preprocessed firstly in the same way: normalized, subtracted with mean image and divided by standard deviation.


    Data written as dictionary with following keys:
    x_train: (59000, 1, 28, 28)
    y_train: (59000,)
    x_validation: (1000, 1, 28, 28)
    y_validation: (1000,)
    x_test: (1000, 1, 28, 28)
    y_test: (1000,)


    Contains pretrained weights model_params_ConvNet1.pickle for the model with following architecture:
    Input --> Conv --> ReLU --> Pool --> Affine --> ReLU --> Affine --> Softmax


    Parameters:

    • Input is 1-channeled GrayScale image.
    • 32 filters of Convolutional Layer.
    • Stride for Pool is 2 and height = width = 2.
    • Number of hidden neurons is 500.
    • Number of output neurons is 10.


    Architecture also can be understood as follows:
    https://www.googleapis.com/download/storage/v1/b/kaggle-user-content/o/inbox%2F3400968%2Fc23041248e82134b7d43ed94307b720e%2FModel_1_Architecture_MNIST.png?generation=1563654250901965&alt=media" alt="">

    Acknowledgements

    Initial data is MNIST that was collected by Yann LeCun, Corinna Cortes, Christopher J.C. Burges.

  8. Z

    [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image...

    • data.niaid.nih.gov
    Updated Nov 28, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Rui Shi (2024). [MedMNIST+] 18x Standardized Datasets for 2D and 3D Biomedical Image Classification with Multiple Size Options: 28 (MNIST-Like), 64, 128, and 224 [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_5208229
    Explore at:
    Dataset updated
    Nov 28, 2024
    Dataset provided by
    Zequan Liu
    Hanspeter Pfister
    Bilian Ke
    Bingbing Ni
    Donglai Wei
    Lin Zhao
    Rui Shi
    Jiancheng Yang
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Code [GitHub] | Publication [Nature Scientific Data'23 / ISBI'21] | Preprint [arXiv]

    Abstract

    We introduce MedMNIST, a large-scale MNIST-like collection of standardized biomedical images, including 12 datasets for 2D and 6 datasets for 3D. All images are pre-processed into 28x28 (2D) or 28x28x28 (3D) with the corresponding classification labels, so that no background knowledge is required for users. Covering primary data modalities in biomedical images, MedMNIST is designed to perform classification on lightweight 2D and 3D images with various data scales (from 100 to 100,000) and diverse tasks (binary/multi-class, ordinal regression and multi-label). The resulting dataset, consisting of approximately 708K 2D images and 10K 3D images in total, could support numerous research and educational purposes in biomedical image analysis, computer vision and machine learning. We benchmark several baseline methods on MedMNIST, including 2D / 3D neural networks and open-source / commercial AutoML tools. The data and code are publicly available at https://medmnist.com/.

    Disclaimer: The only official distribution link for the MedMNIST dataset is Zenodo. We kindly request users to refer to this original dataset link for accurate and up-to-date data.

    Update: We are thrilled to release MedMNIST+ with larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D. As a complement to the previous 28-size MedMNIST, the large-size version could serve as a standardized benchmark for medical foundation models. Install the latest API to try it out!

    Python Usage

    We recommend our official code to download, parse and use the MedMNIST dataset:

    % pip install medmnist% python

    To use the standard 28-size (MNIST-like) version utilizing the downloaded files:

    from medmnist import PathMNIST

    train_dataset = PathMNIST(split="train")

    To enable automatic downloading by setting download=True:

    from medmnist import NoduleMNIST3D

    val_dataset = NoduleMNIST3D(split="val", download=True)

    Alternatively, you can access MedMNIST+ with larger image sizes by specifying the size parameter:

    from medmnist import ChestMNIST

    test_dataset = ChestMNIST(split="test", download=True, size=224)

    Citation

    If you find this project useful, please cite both v1 and v2 paper as:

    Jiancheng Yang, Rui Shi, Donglai Wei, Zequan Liu, Lin Zhao, Bilian Ke, Hanspeter Pfister, Bingbing Ni. Yang, Jiancheng, et al. "MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification." Scientific Data, 2023.

    Jiancheng Yang, Rui Shi, Bingbing Ni. "MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis". IEEE 18th International Symposium on Biomedical Imaging (ISBI), 2021.

    or using bibtex:

    @article{medmnistv2, title={MedMNIST v2-A large-scale lightweight benchmark for 2D and 3D biomedical image classification}, author={Yang, Jiancheng and Shi, Rui and Wei, Donglai and Liu, Zequan and Zhao, Lin and Ke, Bilian and Pfister, Hanspeter and Ni, Bingbing}, journal={Scientific Data}, volume={10}, number={1}, pages={41}, year={2023}, publisher={Nature Publishing Group UK London} }

    @inproceedings{medmnistv1, title={MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis}, author={Yang, Jiancheng and Shi, Rui and Ni, Bingbing}, booktitle={IEEE 18th International Symposium on Biomedical Imaging (ISBI)}, pages={191--195}, year={2021} }

    Please also cite the corresponding paper(s) of source data if you use any subset of MedMNIST as per the description on the project website.

    License

    The MedMNIST dataset is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0), except DermaMNIST under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

    The code is under Apache-2.0 License.

    Changelog

    v3.0 (this repository): Released MedMNIST+ featuring larger sizes: 64x64, 128x128, and 224x224 for 2D, and 64x64x64 for 3D.

    v2.2: Removed a small number of mistakenly included blank samples in OrganAMNIST, OrganCMNIST, OrganSMNIST, OrganMNIST3D, and VesselMNIST3D.

    v2.1: Addressed an issue in the NoduleMNIST3D file (i.e., nodulemnist3d.npz). Further details can be found in this issue.

    v2.0: Launched the initial repository of MedMNIST v2, adding 6 datasets for 3D and 2 for 2D.

    v1.0: Established the initial repository (in a separate repository) of MedMNIST v1, featuring 10 datasets for 2D.

    Note: This dataset is NOT intended for clinical use.

  9. f

    Comparison results (mean ± STD%) of different methods on the MNIST database....

    • plos.figshare.com
    xls
    Updated Jul 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jinshan Qi; Rui Xu (2025). Comparison results (mean ± STD%) of different methods on the MNIST database. [Dataset]. http://doi.org/10.1371/journal.pone.0326950.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jul 17, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Jinshan Qi; Rui Xu
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Comparison results (mean ± STD%) of different methods on the MNIST database.

  10. 4

    Data and code underlying the research of: CCO-ADC for CIM Accelerators

    • data.4tu.nl
    zip
    Updated Feb 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Abhairaj Singh (2024). Data and code underlying the research of: CCO-ADC for CIM Accelerators [Dataset]. http://doi.org/10.4121/e6614bef-e325-4555-b53c-1e236b8b23cd.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Feb 16, 2024
    Dataset provided by
    4TU.ResearchData
    Authors
    Abhairaj Singh
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    DAIS
    Description

    This targets image classification applications. This work presents a memory-periphery co-design to perform accurate A/D conversions of analog matrix-vector-multiplication (MVM) outputs. A novel scheme is introduced where select-lines and bit-lines in the memory are virtu- ally fixed to improve conversion accuracy and aid a ring-oscillator-based A/D conversion, equipped with component sharing and inter-matching of the reference blocks. In addition, we deploy a self-timed technique to further ensure high robustness addressing global design and cycle-to-cycle variations. The concept is demonstrated using a 4Kb CIM chip prototype using resistive bitcells on TSMC 40nm CMOS technology. This dataset includes schematic netlist files, chip photos, raw data on the Excel sheets for latency and power estimations/simulation results, and Matlab codes for generating the graphs and figures in the associated publication.

  11. t

    Shulai Zhang, Zirui Li, Quan Chen, Wenli Zheng, Jingwen Leng, Minyi Guo...

    • service.tib.eu
    Updated Dec 3, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). Shulai Zhang, Zirui Li, Quan Chen, Wenli Zheng, Jingwen Leng, Minyi Guo (2024). Dataset: MNIST, CIFAR10, and FEMNIST datasets. https://doi.org/10.57702/dictjdlh [Dataset]. https://service.tib.eu/ldmservice/dataset/mnist--cifar10--and-femnist-datasets
    Explore at:
    Dataset updated
    Dec 3, 2024
    Description

    MNIST, CIFAR10, and FEMNIST datasets are used to evaluate the effect of accuracy in various datasets.

  12. f

    Dataset features.

    • plos.figshare.com
    xls
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy (2025). Dataset features. [Dataset]. http://doi.org/10.1371/journal.pone.0326145.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Federated clustering is a distributed clustering algorithm that does not require the transmission of raw data and is widely used. However, it struggles to handle Non-IID data effectively because it is difficult to obtain accurate global consistency measures under Non-Independent and Identically Distributed (Non-IID) conditions. To address this issue, we propose a federated k-means clustering algorithm based on a cluster backbone called FKmeansCB. First, we add Laplace noise to all the local data, and run k-means clustering on the client side to obtain cluster centers, which faithfully represent the cluster backbone (i.e., the data structures of the clusters). The cluster backbone represents the client’s features and can approximatively capture the features of different labeled data points in Non-IID situations. We then upload these cluster centers to the server. Subsequently, the server aggregates all cluster centers and runs the k-means clustering algorithm to obtain global cluster centers, which are then sent back to the client. Finally, the client assigns all data points to the nearest global cluster center to produce the final clustering results. We have validated the performance of our proposed algorithm using six datasets, including the large-scale MNIST dataset. Compared with the leading non-federated and federated clustering algorithms, FKmeansCB offers significant advantages in both clustering accuracy and running time.

  13. f

    Accuracy for the fashion-mnist data set.

    • plos.figshare.com
    xls
    Updated Jun 21, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Raoul Heese; Jochen Schmid; Michał Walczak; Michael Bortz (2023). Accuracy for the fashion-mnist data set. [Dataset]. http://doi.org/10.1371/journal.pone.0279876.t006
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 21, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Raoul Heese; Jochen Schmid; Michał Walczak; Michael Bortz
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Top-1 to top-5 accuracy of our naive and of our informed CASIMAC on the fashion-mnist data set. In the naive approach we use a purely Euclidean distance metric between the images, whereas the informed approach also takes the structrual image similarity into account. The best scores are highlighted in bold.

  14. Z

    Galaxy10 SDSS

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Leung, W. Henry (2024). Galaxy10 SDSS [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10844811
    Explore at:
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    Leung, W. Henry
    Bovy, Jo
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Galaxy10 SDSS is a dataset contains 21785 69x69 pixels colored galaxy images (g, r and i band) separated in 10 classes. Galaxy10 SDSS images come from Sloan Digital Sky Survey and labels come from Galaxy Zoo.

    These classes are mutually exclusive, but Galaxy Zoo relies on human volunteers to classify galaxy images and the volunteers do not agree on all images. For this reason, Galaxy10 only contains images for which more than 55% of the votes agree on the class. That is, more than 55% of the votes among 10 classes are for a single class for that particular image. If none of the classes get more than 55%, the image will not be included in Galaxy10 as no agreement was reached. As a result, 21785 images after the cut.

    The justification of 55% as the threshold is based on validation. Galaxy10 is meant to be an alternative to MNIST or Cifar10 as a deep learning toy dataset for astronomers. Thus astroNN.models.Cifar10_CNN is used with Cifar10 as a reference. The validation was done on the same astroNN.models.Cifar10_CNN. 50% threshold will result a poor neural network classification accuracy although around 36000 images in the dataset, many are probably misclassified and neural network has a difficult time to learn. 60% threshold result is similar to 55% , both classification accuracy is similar to Cifar10 dataset on the same network, but 55% threshold will have more images be included in the dataset. Thus 55% was chosen as the threshold to cut data.

    The original images are 424x424, but were cropped to 207x207 centered at the images and then downscaled 3 times via bilinear interpolation to 69x69 in order to make them manageable on most computer and graphics card memory.

    There is no guarantee on the accuracy of the labels. Moreover, Galaxy10 is not a balanced dataset and it should only be used for educational or experimental purpose. If you use Galaxy10 for research purpose, please cite Galaxy Zoo and Sloan Digital Sky Survey.

    For more information on the original classification tree: Galaxy Zoo Decision Tree.

  15. Z

    Data from: SIDDA: SInkhorn Dynamic Domain Adaptation for Image...

    • data.niaid.nih.gov
    Updated Jan 23, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Pandya, Sneh (2025). SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_14583106
    Explore at:
    Dataset updated
    Jan 23, 2025
    Dataset authored and provided by
    Pandya, Sneh
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Datasets used in the paper "SIDDA: SInkhorn Dynamic Domain Adaptation for Image Classification with Equivariant Neural Networks"

    Abstract:

    Modern deep learning models often do not generalize well in the presence of a "covariate shift"; that is, in situations where the training and test data distributions differ, but the conditional distribution of classification labels given the data remains unchanged. In such cases, neural network (NN) generalization can be reduced to a problem of learning more robust, domain-invariant features that enable the correct alignment of the two datasets in the network's latent space. Domain adaptation (DA) methods include a broad range of techniques aimed at achieving this, which allows the model to perform well on multiple datasets. However, these methods have struggled with the need for extensive hyperparameter tuning, which then incurs significant computational costs. In this work, we introduce SIDDA, an out-of-the-box DA training algorithm built upon the Sinkhorn divergence, that can achieve effective domain alignment with minimal hyperparameter tuning and computational overhead. We demonstrate the efficacy of our method on multiple simulated and real datasets of varying complexity, including simple shapes, handwritten digits, and real astronomical observational data. These datasets include covariate shifts induced by noise and blurring, as well as more complex differences between real astronomical data observed by different telescopes. SIDDA is compatible with a variety of NN architectures, and it works particularly well in improving classification accuracy and model calibration when paired with equivariant neural networks (ENNs), which respect data symmetries by design. We find that SIDDA consistently improves the generalization capabilities of NNs, enhancing classification accuracy in unlabeled target data by up to 40%. Simultaneously, the inclusion of SIDDA during training can improve performance on the labeled source data, though with a more modest enhancement of approximately 1%. We also study the efficacy of DA on ENNs with respect to the varying group orders of the dihedral group D_N, and find that the model performance improves as the degree of equivariance increases. Finally, we find that SIDDA can also improve the model calibration on both source and target data. The largest improvements are obtained when the model is applied to the unlabeled target domain, reaching more than an order of magnitude improvement in both the expected calibration error and the Brier score. SIDDA's versatility across various NN models and datasets, combined with its automated approach to domain alignment, has the potential to significantly advance multi-dataset studies by enabling the development of highly generalizable models.

    Datasets:

    Dataset directories include train and test subdirectories, which include the source and target domain data within them. The simulated datasets of shapes and astronomical objects were generated using DeepBench, with code for noise and PSF blurring found on our Github. The MNIST-M dataset can be found publically, and the Galaxy Zoo Evo dataset can be accessed following the steps on HuggingFace. Data was split into an 80%/20% train/test split.

    Simulated shapes:

    train:

    source

    target (noise)

    test:

    source

    target (noise)

    Simulated astronomical objects:

    train:

    source

    target (noise)

    test:

    source

    target (noise

    MNIST-M:

    train:

    source

    target (noise)

    target (PSF)

    test:

    source

    target (noise)

    target (PSF)

    Galaxy Zoo Evo:

    train:

    source (GZ SDSS)

    target (GZ DESI)

    test:

    source (GZ SDSS)

    target (GZ DESI)

    Paper Data:

    Data for generating Figures 4 and 5 in the paper are included in isomap_plot_data.zip and js_distances_group_order.zip, respectively. The code for generating the figures can be found in the notebooks on our Github. Figures 2 and 3 are visualizations of the datasets included here.

  16. LLM prompts in the context of machine learning

    • kaggle.com
    Updated Jul 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jordan Nelson (2024). LLM prompts in the context of machine learning [Dataset]. https://www.kaggle.com/datasets/jordanln/llm-prompts-in-the-context-of-machine-learning
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 1, 2024
    Dataset provided by
    Kaggle
    Authors
    Jordan Nelson
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is an extension of my previous work on creating a dataset for natural language processing tasks. It leverages binary representation to characterise various machine learning models. The attributes in the dataset are derived from a dictionary, which was constructed from a corpus of prompts typically provided to a large language model (LLM). These prompts reference specific machine learning algorithms and their implementations. For instance, consider a user asking an LLM or a generative AI to create a Multi-Layer Perceptron (MLP) model for a particular application. By applying this concept to multiple machine learning models, we constructed our corpus. This corpus was then transformed into the current dataset using a bag-of-words approach. In this dataset, each attribute corresponds to a word from our dictionary, represented as a binary value: 1 indicates the presence of the word in a given prompt, and 0 indicates its absence. At the end of each entry, there is a label. Each entry in the dataset pertains to a single class, where each class represents a distinct machine learning model or algorithm. This dataset is intended for multi-class classification tasks, not multi-label classification, as each entry is associated with only one label and does not belong to multiple labels simultaneously. This dataset has been utilised with a Convolutional Neural Network (CNN) using the Keras Automodel API, achieving impressive training and testing accuracy rates exceeding 97%. Post-training, the model's predictive performance was rigorously evaluated in a production environment, where it continued to demonstrate exceptional accuracy. For this evaluation, we employed a series of questions, which are listed below. These questions were intentionally designed to be similar to ensure that the model can effectively distinguish between different machine learning models, even when the prompts are closely related.

    KNN How would you create a KNN model to classify emails as spam or not spam based on their content and metadata? How could you implement a KNN model to classify handwritten digits using the MNIST dataset? How would you use a KNN approach to build a recommendation system for suggesting movies to users based on their ratings and preferences? How could you employ a KNN algorithm to predict the price of a house based on features such as its location, size, and number of bedrooms etc? Can you create a KNN model for classifying different species of flowers based on their petal length, petal width, sepal length, and sepal width? How would you utilise a KNN model to predict the sentiment (positive, negative, or neutral) of text reviews or comments? Can you create a KNN model for me that could be used in malware classification? Can you make me a KNN model that can detect a network intrusion when looking at encrypted network traffic? Can you make a KNN model that would predict the stock price of a given stock for the next week? Can you create a KNN model that could be used to detect malware when using a dataset relating to certain permissions a piece of software may have access to?

    Decision Tree Can you describe the steps involved in building a decision tree model to classify medical images as malignant or benign for cancer diagnosis and return a model for me? How can you utilise a decision tree approach to develop a model for classifying news articles into different categories (e.g., politics, sports, entertainment) based on their textual content? What approach would you take to create a decision tree model for recommending personalised university courses to students based on their academic strengths and weaknesses? Can you describe how to create a decision tree model for identifying potential fraud in financial transactions based on transaction history, user behaviour, and other relevant data? In what ways might you apply a decision tree model to classify customer complaints into different categories determining the severity of language used? Can you create a decision tree classifier for me? Can you make me a decision tree model that will help me determine the best course of action across a given set of strategies? Can you create a decision tree model for me that can recommend certain cars to customers based on their preferences and budget? How can you make a decision tree model that will predict the movement of star constellations in the sky based on data provided by the NASA website? How do I create a decision tree for time-series forecasting?

    Random Forest Can you describe the steps involved in building a random forest model to classify different types of anomalies in network traffic data for cybersecurity purposes and return the code for me? In what ways could you implement a random forest model to predict the severity of traffic congestion in urban areas based on historical traffic patterns, weather...

  17. American Sign Language dataset for semantic communications

    • zenodo.org
    • ieee-dataport.org
    zip
    Updated Jan 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vasileios Kouvakis; Vasileios Kouvakis; Lamprini Mitsiou; Stylianos E. Trevlakis; Stylianos E. Trevlakis; Alexandros-Apostolos A. Boulogeorgos; Alexandros-Apostolos A. Boulogeorgos; Theodoros Tsiftsis; Theodoros Tsiftsis; Lamprini Mitsiou (2025). American Sign Language dataset for semantic communications [Dataset]. http://doi.org/10.21227/2c1z-8j21
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 12, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Vasileios Kouvakis; Vasileios Kouvakis; Lamprini Mitsiou; Stylianos E. Trevlakis; Stylianos E. Trevlakis; Alexandros-Apostolos A. Boulogeorgos; Alexandros-Apostolos A. Boulogeorgos; Theodoros Tsiftsis; Theodoros Tsiftsis; Lamprini Mitsiou
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    United States
    Description

    The dataset was developed as part of the NANCY project (https://nancy-project.eu/) to support tasks in the computer vision area. It is specifically designed for sign language recognition, focusing on representing joints and finger positions. The dataset comprises images of hands that represent the alphabet in American Sign Language (ASL), with the exception of the letters "J" and "Z," as these involve motion and the dataset is limited to static images. A significant feature of the dataset is the use of color-coding, where each finger is associated with a distinct color. This approach enhances the ability to extract features and distinguish between different fingers, offering significant advantages over traditional grayscale datasets like MNIST. The dataset consists of RGB images, which enhance the recognition process and support more effective learning, achieving high performance even with a relatively modest amount of training data. This format improves the ability to discriminate and extract features compared to grayscale images. Although the use of RGB images introduces additional complexity, such as increased data representation and storage requirements, the advantages in accuracy and feature extraction make it a valuable choice. The dataset is well-suited for applications involving gesture recognition, sign language interpretation, and other tasks requiring detailed analysis of joint and finger positions.

    The NANCY project has received funding from the Smart Networks and Services Joint Undertaking (SNS JU) under the European Union's Horizon Europe research and innovation programme under Grant Agreement No 101096456.

  18. O

    notMNIST

    • opendatalab.com
    • datasets.activeloop.ai
    • +3more
    zip
    Updated Sep 8, 2011
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vrije University Amsterdam (2011). notMNIST [Dataset]. https://opendatalab.com/OpenDataLab/notMNIST
    Explore at:
    zipAvailable download formats
    Dataset updated
    Sep 8, 2011
    Dataset provided by
    University of Amsterdam
    Vrije University Amsterdam
    Skoltech
    Description

    Taken some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A-J taken from different fonts. Here are some examples of letter "A".udging by the examples, one would expect this to be a harder task than MNIST. This seems to be the case -- logistic regression on top of stacked auto-encoder with fine-tuning gets about 89% accuracy whereas same approach gives got 98% on MNIST. Dataset consists of small hand-cleaned part, about 19k instances, and large uncleaned dataset, 500k instances. Two parts have approximately 0.5% and 6.5% label error rate. I got this by looking through glyphs and counting how often my guess of the letter didn't match it's unicode value in the font file.

  19. f

    The influence of parameters privacy budget () on clustering results (ARI).

    • plos.figshare.com
    • figshare.com
    xls
    Updated Jun 12, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy (2025). The influence of parameters privacy budget () on clustering results (ARI). [Dataset]. http://doi.org/10.1371/journal.pone.0326145.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    Jun 12, 2025
    Dataset provided by
    PLOS ONE
    Authors
    Zilong Deng; Yizhang Wang; Mustafa Muwafak Alobaedy
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The influence of parameters privacy budget () on clustering results (ARI).

  20. d

    Replication Data for: Exploring Neural Network Weaknesses: Insights from...

    • search.dataone.org
    Updated Dec 16, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zhang, Jun-Jie; Deyu Meng (2023). Replication Data for: Exploring Neural Network Weaknesses: Insights from Quantum Principles [Dataset]. http://doi.org/10.7910/DVN/SWDL1S
    Explore at:
    Dataset updated
    Dec 16, 2023
    Dataset provided by
    Harvard Dataverse
    Authors
    Zhang, Jun-Jie; Deyu Meng
    Description

    The dataset contains the code and raw data for exploiting the accuracy-robustness trade-off from the principle of the uncertainty principle in quantum physics. # The folder contains two sub-folders: "data upload" and "figure&plot". ## In "data upload" the three network structures are used for cifar-10 and mnist. Take the sub-sub-folder "cifar conv" as an example. One starts with the two notebooks named "selected_train_netwrok1_test2.ipynb" and "selected_train_netwrok2_test2.ipynb", where the former performs the training of the complete Convolutional Network while the later divide the convolutional layers into two parts - feature extractor and classifier. After running the the two notebooks, the weights of the networks at each training epoch are saved in the folder "model". Then one runs the other two notebooks named "scanner-x.ipynb" and "scanner-feature-crt.ipynb", where the former performs the Monte-Carlo integrations on multi-GPUs with respect to the normalized loss function of the complete Convolutional Network, while the later only integrates the classifiers (the second part of the complete Convolutional Network). Last, one opens the notebook "plotter.ipynb" to see the results. ## In "figure&plot" we mainly plot the figures in the paper. The txt files are simply copied from the "data upload" folder. To see the figures, one needs to open the file "plot.nb" with Mathematica.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
satya (2021). cnn_c1 [Dataset]. https://www.kaggle.com/satyapr/cnn-c1/code
Organization logo

cnn_c1

train cnn model accuracy 98.3 mnist handwritten

Explore at:
6 scholarly articles cite this dataset (View in Google Scholar)
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
satya
Description

Dataset

This dataset was created by satya

Contents

Search
Clear search
Close search
Google apps
Main menu