100+ datasets found
  1. Data scaling using machine learning

    • kaggle.com
    zip
    Updated May 9, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Abbas (2024). Data scaling using machine learning [Dataset]. https://www.kaggle.com/datasets/muuhamadabbas/data-scaling-using-machine-learning
    Explore at:
    zip(1688 bytes)Available download formats
    Dataset updated
    May 9, 2024
    Authors
    Muhammad Abbas
    Description

    Dataset

    This dataset was created by Muhammad Abbas

    Contents

  2. f

    Data from: Data Scaling and Generalization Insights for Medicinal Chemistry...

    • datasetcatalog.nlm.nih.gov
    Updated Jun 2, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Chen, Jacky; Tynan, Jonathan; Yang, Song; Cheng, Alan C.; Cheng, Chen; Chung, Yunsie (2025). Data Scaling and Generalization Insights for Medicinal Chemistry Deep Learning Models [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002061833
    Explore at:
    Dataset updated
    Jun 2, 2025
    Authors
    Chen, Jacky; Tynan, Jonathan; Yang, Song; Cheng, Alan C.; Cheng, Chen; Chung, Yunsie
    Description

    Predictive models hold considerable promise in enabling the faster discovery of safer, more efficacious therapeutics. To better understand and improve the performance of small-molecule predictive models for drug discovery, we conduct multiple experiments with deep learning and traditional machine learning approaches, leveraging our large internal data sets as well as publicly available data sets. The experiments include assessing performance on random, temporal, and reverse-temporal data ablation tasks as well as tasks testing model extrapolation to different property spaces. We identify factors that contribute to the higher performance of predictive models built using graph neural networks compared to traditional methods such as XGBoost and random forest. These insights were successfully used to develop a scaling relationship that explains 81% of the variance in model performance across various assays and data regimes. This relationship can be used to estimate the performance of models for ADMET (absorption, distribution, metabolism, excretion, and toxicity) end points, as well as for drug discovery assay data more broadly. The findings offer guidance for further improving model performance in drug discovery.

  3. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  4. Data from: Discovering Organic Reactions with a Machine-Learning-Powered...

    • figshare.com
    txt
    Updated Feb 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Konstantin S. Kozlov; Daniil Boiko; Julia V. Burykina; Valentina V. Ilyushenkova; Alexander Yu. Kostyukovich; Ekaterina D. Patil; Valentine Ananikov (2025). Discovering Organic Reactions with a Machine-Learning-Powered Deciphering of Tera-Scale Mass Spectrometry Data [Dataset]. http://doi.org/10.6084/m9.figshare.27949029.v1
    Explore at:
    txtAvailable download formats
    Dataset updated
    Feb 17, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    Konstantin S. Kozlov; Daniil Boiko; Julia V. Burykina; Valentina V. Ilyushenkova; Alexander Yu. Kostyukovich; Ekaterina D. Patil; Valentine Ananikov
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The accumulation of large datasets by the scientific community has surpassed the capacity of traditional processing methods, underscoring the critical need for innovative and efficient algorithms capable of navigating through extensive existing experimental data. Addressing this challenge, our study introduces a machine learning (ML)-powered search engine specifically tailored for analyzing tera-scale high-resolution mass spectrometry (HRMS) data. This engine harnesses a novel isotope-distribution-centric search algorithm augmented by two synergistic ML models, assisting with discovery of hitherto unknown chemical reactions. This methodology enables the rigorous investigation of existing data, thus providing efficient support for chemical hypotheses while reducing the need for conducting additional experiments. Moreover, we extend this approach with baseline methods for automated reaction hypothesis generation. In its practical validation, our approach successfully identified several reactions, unveiling previously undescribed transformations. Among these, the heterocycle-vinyl coupling process within the Mizoroki-Heck reaction stands out, highlighting the capability of the engine to elucidate complex chemical phenomena.

  5. Z

    Data from: Auto-scaling dataset based on the gym-hpa framework

    • data-staging.niaid.nih.gov
    • data.niaid.nih.gov
    Updated Jun 9, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Santos, Jose; Wauters, Tim; Volckaert, Bruno; De Turck, Filip (2023). Auto-scaling dataset based on the gym-hpa framework [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_7944660
    Explore at:
    Dataset updated
    Jun 9, 2023
    Dataset provided by
    Ghent University - imec - IDLab
    Authors
    Santos, Jose; Wauters, Tim; Volckaert, Bruno; De Turck, Filip
    Description

    The implemented gym-hpa is a custom OpenAi Gym environment for the training of Reinforcement Learning (RL) agents for auto-scaling research in the Kubernetes (K8s) platform.

    Two environments exist based on the Redis Cluster and Online Boutique applications.

    Two collected datasets are shared here. The code has been released here: https://github.com/jpedro1992/gym-hpa

    Related Publication: Santos, J. et al. "gym-hpa: Efficient auto-scaling via reinforcement learning for complex microservice-based applications in Kubernetes." NOMS2023, the IEEE/IFIP Network Operations and Management Symposium. 2023.

  6. University students' educational scale data

    • kaggle.com
    zip
    Updated Apr 26, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    tommychihsinlee0120 (2023). University students' educational scale data [Dataset]. https://www.kaggle.com/datasets/tommychihsinlee0120/mymasterthesisdata
    Explore at:
    zip(169346 bytes)Available download formats
    Dataset updated
    Apr 26, 2023
    Authors
    tommychihsinlee0120
    Description

    This is my master thesis dataset, and the purpose of my thesis is to explore the impact of 4 different kinds fo learning approach on self-efficacy,engagement, attention, achievement and brainwave signal. Try to make some visualization and find ssomething cool in trends, relationship or distribution form this dataset.

  7. i

    Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...

    • ieee-dataport.org
    Updated Aug 10, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://ieee-dataport.org/documents/large-scale-dataset-twitter-chatter-about-online-learning-during-current-covid-19-omicron
    Explore at:
    Dataset updated
    Aug 10, 2022
    Authors
    Nirmalya Thakur
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    no. 8

  8. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  9. f

    Data from: Transfer Learning Under Large-Scale Low-Rank Regression Models

    • tandf.figshare.com
    pdf
    Updated Oct 17, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Seyoung Park; Eun Ryung Lee; Hyunjin Kim; Hongyu Zhao (2025). Transfer Learning Under Large-Scale Low-Rank Regression Models [Dataset]. http://doi.org/10.6084/m9.figshare.30153080.v2
    Explore at:
    pdfAvailable download formats
    Dataset updated
    Oct 17, 2025
    Dataset provided by
    Taylor & Francis
    Authors
    Seyoung Park; Eun Ryung Lee; Hyunjin Kim; Hongyu Zhao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In high-dimensional multiple response regression problems, the large dimensionality of the coefficient matrix poses a challenge to parameter estimation. To address this challenge, low-rank matrix estimation methods have been developed to facilitate parameter estimation in the high-dimensional regime, where the number of parameters increases with sample size. Despite these methodological advances, accurately predicting multiple responses with limited target data remains a difficult task. To gain statistical power, the use of diverse datasets from source domains has emerged as a promising approach. In this article, we focus on the problem of transfer learning in a high-dimensional multiple response regression framework, which aims to improve estimation accuracy by transferring knowledge from informative source datasets. To reduce potential performance degradation due to the transfer of knowledge from irrelevant sources, we propose a novel transfer learning procedure including the forward selection of informative source sets. In particular, our forward source selection method is new compared to existing transfer learning framework, offering deeper theoretical insights and substantial methodological innovations. Theoretical results show that the proposed estimator achieves a faster convergence rate than the single-task penalized estimator using only target data. In addition, we develop an alternative transfer learning based on non-convex penalization to ensure rank consistency. Through simulations and real data experiments, we provide empirical evidence for the effectiveness of the proposed method and for its superiority over other methods. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

  10. Data from: Industry-scale Application and Evaluation of Deep Learning for...

    • data.europa.eu
    • data.niaid.nih.gov
    unknown
    Updated Apr 20, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2020). Industry-scale Application and Evaluation of Deep Learning for Drug Target Prediction [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-3239499?locale=hr
    Explore at:
    unknown(2146127270)Available download formats
    Dataset updated
    Apr 20, 2020
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.

  11. 4

    Learning Curves Database 1.1

    • data.4tu.nl
    zip
    Updated May 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Cheng Yan; Felix Mohr; Tom Viering (2025). Learning Curves Database 1.1 [Dataset]. http://doi.org/10.4121/3bd18108-fad0-4e4c-affd-4341fba99306.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 27, 2025
    Dataset provided by
    4TU.ResearchData
    Authors
    Cheng Yan; Felix Mohr; Tom Viering
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Sample-wise learning curves plot performance versus training set size. They are useful for studying scaling laws and speeding up hyperparameter tuning and model selection. Learning curves are often assumed to be well-behaved: monotone (i.e. improving with more data) and convex. By constructing the Learning Curves Database 1.1 (LCDB 1.1), a large-scale database with high-resolution learning curves including more modern learners (CatBoost, TabNet, RealMLP and TabPFN), we show that learning curves are less often well-behaved than previously thought. Using statistically rigorous methods, we observe significant ill-behavior in approximately 15% of the learning curves, almost twice as much as in previous estimates. We also identify which learners are to blame and show that specific learners are more ill-behaved than others. Additionally, we demonstrate that different feature scalings rarely resolve ill-behavior. We evaluate the impact of ill-behavior on downstream tasks, such as learning curve fitting and model selection, and find it poses significant challenges, underscoring the relevance and potential of LCDB 1.1 as a challenging benchmark for future research.

  12. d

    Making Predictions using Large Scale Gaussian Processes

    • catalog.data.gov
    • s.cnmilf.com
    • +1more
    Updated Aug 22, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Dashlink (2025). Making Predictions using Large Scale Gaussian Processes [Dataset]. https://catalog.data.gov/dataset/making-predictions-using-large-scale-gaussian-processes
    Explore at:
    Dataset updated
    Aug 22, 2025
    Dataset provided by
    Dashlink
    Description

    One of the key problems that arises in many areas is to estimate a potentially nonlinear function [tex] G(x, \theta)[/tex] given input and output samples tex [/tex] so that [tex]y approx G(x, \theta)[/tex]. There are many approaches to addressing this regression problem. Neural networks, regression trees, and many other methods have been developed to estimate [tex]$G$[/tex] given the input output pair tex [/tex]. One method that I have worked with is called Gaussian process regression. There many good texts and papers on the subject. For more technical information on the method and its applications see: http://www.gaussianprocess.org/ A key problem that arises in developing these models on very large data sets is that it ends up requiring an [tex]O(N^3)[/tex] computation where N is the number of data points and the training sample. Obviously this becomes very problematic when N is large. I discussed this problem with Leslie Foster, a mathematics professor at San Jose State University. He, along with some of his students, developed a method to address this problem based on Cholesky decomposition and pivoting. He also shows that this leads to a numerically stable result. If ou're interested in some light reading, I’d suggest you take a look at his recent paper (which was accepted in the Journal of Machine Learning Research) posted on dashlink. We've also posted code for you to try it out. Let us know how it goes. If you are interested in applications of this method in the area of prognostics, check out our new paper on the subject which was published in IEEE Transactions on Systems, Man, and Cybernetics.

  13. Data for: "Pore-scale pathways: A machine learning–ready dataset of...

    • zenodo.org
    zip
    Updated Aug 13, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saideep Pavuluri; Saideep Pavuluri; Harris Rabbani; Ashraf Unais; Harris Rabbani; Ashraf Unais (2025). Data for: "Pore-scale pathways: A machine learning–ready dataset of multiphase flow in porous media" [Dataset]. http://doi.org/10.5281/zenodo.16533424
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 13, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Saideep Pavuluri; Saideep Pavuluri; Harris Rabbani; Ashraf Unais; Harris Rabbani; Ashraf Unais
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset provides a comprehensive collection of high-fidelity, pore-scale direct numerical simulations examining multiphase flow in porous media. The simulations systematically investigate the evolution of fluid invasion patterns as functions of pore structure heterogeneity, viscosity ratio between displacing and displaced fluids, and the wettability of the porous matrix. The dataset includes 540 individual simulation cases conducted using the OpenFOAM software. For each case, OpenFOAM input and output files are provided. The output files are provided for every simulated second and just before the invading fluid reaches the outlet, allowing detailed temporal analysis of flow governing variables such as pressure and velocity. In addition, a single visualization snapshot is included for each case, highlighting the invasion profile immediately prior to breakthrough. This dataset enables in-depth analysis of multiphase displacement phenomena, serving as a valuable resource for studying the effects of structural heterogeneity, viscosity contrast, and wettability on fluid invasion at the pore scale. It is ideally suited for validation of machine-learning surrogate models, and also to benchmark the flows predicted other computational models like pore network models.

  14. Big Data Learning Analytics Dataset

    • kaggle.com
    zip
    Updated Aug 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ziya (2025). Big Data Learning Analytics Dataset [Dataset]. https://www.kaggle.com/datasets/ziya07/big-data-learning-analytics-dataset/code
    Explore at:
    zip(14637 bytes)Available download formats
    Dataset updated
    Aug 21, 2025
    Authors
    Ziya
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    This dataset is designed for Big Data Learning Analytics in Personalized Labor Education, providing a multi-modal view of learners’ progress by combining digital interactions, physical task performance, and behavioral signals. It is intended to support research in predictive modeling, clustering, and recommendation systems for adaptive instruction.

    🔑 Key Features

    Student Metadata

    Student_ID: Unique anonymized learner ID

    Cluster_Label: (Optional) Learner grouping such as struggling, average, or high achievers

    LMS Interaction Features

    login_frequency: Number of times a student logs into the LMS

    avg_session_duration: Average time spent per learning session (minutes)

    completed_assignments: Number of submitted tasks

    quiz_scores_avg: Mean quiz scores (%)

    forum_participation: Number of posts or replies in discussions

    resource_views: Count of accessed resources (videos, manuals, docs)

    Sensor-Based Task Data

    task_completion_time: Average task duration (minutes)

    task_accuracy: Task correctness (%)

    error_count: Number of mistakes per task

    repetition_needed: Attempts required to complete a task

    motion_intensity: Physical effort measured (0–1 scale)

    safety_violations: Number of safety protocol breaches

    Engagement & Behavioral Signals

    attention_span_score: Focus level (0–1 scale)

    fatigue_level: Fatigue measure (0–1 scale)

    collaboration_index: Peer learning and teamwork score (0–1 scale)

    Learning Outcomes & Labels

    performance_score: Overall score (0–100)

    efficiency_gain: Improvement percentage over baseline (%)

    instruction_recommendation: Personalized strategy category

    Competency-Based (reinforcement required)

    Adaptive (personalized task adjustments)

    Collaborative (peer mentoring and teamwork readiness)

    This dataset provides a comprehensive, real-world inspired resource for exploring AI-driven personalized education, early intervention systems, and smart learning environments.

  15. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  16. Dataset used in Design Analytics for Mobile Learning: Scaling up...

    • data.europa.eu
    • data.niaid.nih.gov
    • +1more
    unknown
    Updated Feb 28, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zenodo (2022). Dataset used in Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-6320368?locale=ga
    Explore at:
    unknownAvailable download formats
    Dataset updated
    Feb 28, 2022
    Dataset authored and provided by
    Zenodohttp://zenodo.org/
    Description

    The following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements". Abstract This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.

  17. D

    Data Collection And Labeling Report

    • datainsightsmarket.com
    doc, pdf, ppt
    Updated Aug 12, 2025
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Data Insights Market (2025). Data Collection And Labeling Report [Dataset]. https://www.datainsightsmarket.com/reports/data-collection-and-labeling-1415734
    Explore at:
    pdf, doc, pptAvailable download formats
    Dataset updated
    Aug 12, 2025
    Dataset authored and provided by
    Data Insights Market
    License

    https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy

    Time period covered
    2025 - 2033
    Area covered
    Global
    Variables measured
    Market Size
    Description

    The Data Collection and Labeling market is experiencing robust growth, driven by the increasing demand for high-quality training data to fuel the advancements in artificial intelligence (AI) and machine learning (ML) technologies. The market's expansion is fueled by the burgeoning adoption of AI across diverse sectors, including healthcare, automotive, finance, and retail. Companies are increasingly recognizing the critical role of accurate and well-labeled data in developing effective AI models. This has led to a surge in outsourcing data collection and labeling tasks to specialized companies, contributing to the market's expansion. The market is segmented by data type (image, text, audio, video), labeling technique (supervised, unsupervised, semi-supervised), and industry vertical. We project a steady CAGR of 20% for the period 2025-2033, reflecting continued strong demand across various applications. Key trends include the increasing use of automation and AI-powered tools to streamline the data labeling process, resulting in higher efficiency and lower costs. The growing demand for synthetic data generation is also emerging as a significant trend, alleviating concerns about data privacy and scarcity. However, challenges remain, including data bias, ensuring data quality, and the high cost associated with manual labeling for complex datasets. These restraints are being addressed through technological innovations and improvements in data management practices. The competitive landscape is characterized by a mix of established players and emerging startups. Companies like Scale AI, Appen, and others are leading the market, offering comprehensive solutions that span data collection, annotation, and model validation. The presence of numerous companies suggests a fragmented yet dynamic market, with ongoing competition driving innovation and service enhancements. The geographical distribution of the market is expected to be broad, with North America and Europe currently holding significant market share, followed by Asia-Pacific showing robust growth potential. Future growth will depend on technological advancements, increasing investment in AI, and the emergence of new applications that rely on high-quality data.

  18. Dataset for Feature Scaling [Standardization]

    • kaggle.com
    zip
    Updated Nov 30, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mit Gandhi (2024). Dataset for Feature Scaling [Standardization] [Dataset]. https://www.kaggle.com/datasets/mitgandhi10/dataset-for-feature-scaling-standardization
    Explore at:
    zip(951 bytes)Available download formats
    Dataset updated
    Nov 30, 2024
    Authors
    Mit Gandhi
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset contains information about three species of Iris flowers: Setosa, Versicolour, and Virginica. It is a well-known dataset in the machine learning and statistics communities, often used for classification and clustering tasks. Each row represents a sample of an Iris flower, with measurements of its physical attributes and the corresponding target label.

    Dataset Features: sepal length (cm): The length of the sepal in centimeters. sepal width (cm): The width of the sepal in centimeters. petal length (cm): The length of the petal in centimeters. petal width (cm): The width of the petal in centimeters. target: A numerical label (0, 1, or 2) indicating the flower species: 0: Setosa 1: Versicolour 2: Virginica

    Purpose: This dataset can be used for: Supervised learning tasks, particularly classification. Exploratory data analysis and visualization of flower attributes. Understanding the application of machine learning algorithms like decision trees, KNN, and support vector machines.

    Source: This is a modified version of the classic Iris flower dataset, often used for beginner-level machine learning projects and demonstrations.

    Potential Use Cases: Training machine learning models for flower classification. Practicing data preprocessing, feature scaling, and visualization techniques. Understanding the relationships between features through scatter plots and correlation analysis.

  19. D

    Data from: Generalized DeepONets for Viscosity Prediction Using Learned...

    • darus.uni-stuttgart.de
    • search.nfdi4chem.de
    Updated Aug 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Maximilian Fleck; Marcelle Spera; Samir Darouich; Timo Klenk; Niels Hansen (2025). Generalized DeepONets for Viscosity Prediction Using Learned Entropy Scaling References [Dataset]. http://doi.org/10.18419/DARUS-5256
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 14, 2025
    Dataset provided by
    DaRUS
    Authors
    Maximilian Fleck; Marcelle Spera; Samir Darouich; Timo Klenk; Niels Hansen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    DFG
    Description

    Data-driven approaches used to predict thermophysical properties benefit from physical constraints because the extrapolation behavior can be improved and the amount of training data be reduced. In the present work, the well-established entropy scaling approach is incorporated into a neural network architecture to predict the shear viscosity of a diverse set of pure fluids over a large temperature and pressure range. Instead of imposing a particular form of the reference entropy and reference shear viscosity, these properties are learned. The resulting architecture can be interpreted as two linked DeepONets with generalization capabilities.

  20. 4

    Data from: Papyrus - A large scale curated dataset aimed at bioactivity...

    • data.4tu.nl
    • figshare.com
    zip
    Updated Oct 29, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Olivier Béquignon; Brandon Bongers; W. (Willem) Jespers; Adriaan P. IJzerman; Bob van de Water; Gerard JP Van westen (2021). Papyrus - A large scale curated dataset aimed at bioactivity predictions [Dataset]. http://doi.org/10.4121/16896406.v2
    Explore at:
    zipAvailable download formats
    Dataset updated
    Oct 29, 2021
    Dataset provided by
    4TU.ResearchData
    Authors
    Olivier Béquignon; Brandon Bongers; W. (Willem) Jespers; Adriaan P. IJzerman; Bob van de Water; Gerard JP Van westen
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Dataset funded by
    European Commission
    Description

    This repository contains the Papyrus dataset, an aggregated dataset of small molecule bioactivities, as described in the manuscript "Papyrus - A large scale curated dataset aimed at bioactivity predictions" (Work in Progress).

    With the recent rapid growth of publicly available ligand-protein bioactivity data, there is a trove of viable data that can be used to train machine learning algorithms. However, not all data is equal in terms of size and quality, and a significant portion of researcher’s time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. As an answer to that, we have constructed the Papyrus dataset, comprised of around 60 million datapoints. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with smaller datasets containing high quality data. This aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways, and also perform some rudimentary quantitative structure-activity relationship and proteochemometrics modeling. Our ambition is to create a benchmark set that can be used for constructing predictive models, while also providing a solid baseline for related research.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Muhammad Abbas (2024). Data scaling using machine learning [Dataset]. https://www.kaggle.com/datasets/muuhamadabbas/data-scaling-using-machine-learning
Organization logo

Data scaling using machine learning

Explore at:
zip(1688 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Muhammad Abbas
Description

Dataset

This dataset was created by Muhammad Abbas

Contents

Search
Clear search
Close search
Google apps
Main menu