100+ datasets found

Data scaling using machine learning
kaggle.com
zip
Updated May 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Abbas (2024). Data scaling using machine learning [Dataset]. https://www.kaggle.com/datasets/muuhamadabbas/data-scaling-using-machine-learning
Explore at:
zip(1688 bytes)Available download formats
Dataset updated
May 9, 2024
Authors
Muhammad Abbas
Description
Dataset

This dataset was created by Muhammad Abbas

Contents
Europe Data for Feature Scaling
kaggle.com
zip
Updated Sep 21, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Study Mart (2020). Europe Data for Feature Scaling [Dataset]. https://www.kaggle.com/datasets/studymart/europe-data-for-feature-scaling
Explore at:
zip(309 bytes)Available download formats
Dataset updated
Sep 21, 2020
Authors
Study Mart
Area covered
Europe
Description
Dataset

This dataset was created by Study Mart

Contents
f
Data from: Data Scaling and Generalization Insights for Medicinal Chemistry...
datasetcatalog.nlm.nih.gov
Updated Jun 2, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chen, Jacky; Tynan, Jonathan; Yang, Song; Cheng, Alan C.; Cheng, Chen; Chung, Yunsie (2025). Data Scaling and Generalization Insights for Medicinal Chemistry Deep Learning Models [Dataset]. https://datasetcatalog.nlm.nih.gov/dataset?q=0002061833
Explore at:
Dataset updated
Jun 2, 2025
Authors
Chen, Jacky; Tynan, Jonathan; Yang, Song; Cheng, Alan C.; Cheng, Chen; Chung, Yunsie
Description
Predictive models hold considerable promise in enabling the faster discovery of safer, more efficacious therapeutics. To better understand and improve the performance of small-molecule predictive models for drug discovery, we conduct multiple experiments with deep learning and traditional machine learning approaches, leveraging our large internal data sets as well as publicly available data sets. The experiments include assessing performance on random, temporal, and reverse-temporal data ablation tasks as well as tasks testing model extrapolation to different property spaces. We identify factors that contribute to the higher performance of predictive models built using graph neural networks compared to traditional methods such as XGBoost and random forest. These insights were successfully used to develop a scaling relationship that explains 81% of the variance in model performance across various assays and data regimes. This relationship can be used to estimate the performance of models for ADMET (absorption, distribution, metabolism, excretion, and toxicity) end points, as well as for drug discovery assay data more broadly. The findings offer guidance for further improving model performance in drug discovery.
The datasets used in this research.
plos.figshare.com
xls
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chantha Wongoutong (2024). The datasets used in this research. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t001
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310839.t001
Dataset updated
Dec 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Chantha Wongoutong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.
Rescaled Fashion-MNIST dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15187793
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled Fashion-MNIST dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
The performance results for k-means clustering and testing the hypothesis...
plos.figshare.com
xls
Updated Dec 6, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chantha Wongoutong (2024). The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t003
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310839.t003
Dataset updated
Dec 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Chantha Wongoutong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.
h
Data from: Study of scaling in hadronic production of dimuons
hepdata.net
Updated Oct 27, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2016). Study of scaling in hadronic production of dimuons [Dataset]. http://doi.org/10.17182/hepdata.3337.v1
Explore at:
Unique identifier
https://doi.org/10.17182/hepdata.3337.v1
Dataset updated
Oct 27, 2016
Description
PLAB=200,300,400 GEV/C. COLUMBIA-FERMILAB-STONY BROOK COLLABORATION.
H
Replication data for: Mokken Scale Analysis: A Nonparametric Version of...
dataverse.harvard.edu
Updated Mar 8, 2010
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wijbrandt van Schuur (2010). Replication data for: Mokken Scale Analysis: A Nonparametric Version of Guttman Scaling for Survey Research [Dataset]. http://doi.org/10.7910/DVN/8VWE6A
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.7910/DVN/8VWE6A
Dataset updated
Mar 8, 2010
Dataset provided by
Harvard Dataverse
Authors
Wijbrandt van Schuur
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
This article introduces a model of ordinal unidimensional measurement known as Mokken scale analysis. Mokken scaling is based on principles of Item Response Theory (IRT) that originated in the Guttman scale. I compare the Mokken model with both Classical Test Theory (reliability or factor analysis) and parametric IRT models (especially with the one-parameter logistic model known as the Rasch model). Two nonparametric probabilistic versions of the Mokken model are described: the model of Monotone Homogeneity and the model of Double Monotonicity. I give procedures for dealing with both dichotomous and polytomous data, along with two scale analyses of data from the World Values Study that demonstrate the usefulness of the Mokken model.
Z
Dataset used in Design Analytics for Mobile Learning: Scaling up...
data-staging.niaid.nih.gov
Updated Mar 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerti (2022). Dataset used in Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_6320367
Explore at:
Dataset updated
Mar 1, 2022
Dataset provided by
Pishtari
Authors
Gerti
Description
The following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements".

Abstract

This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.
Rescaled CIFAR-10 dataset
zenodo.org
Updated Jun 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Jun 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Z
Data from: Auto-scaling dataset based on the gym-hpa framework
data.niaid.nih.gov
data-staging.niaid.nih.gov
Updated Jun 9, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Santos, Jose; Wauters, Tim; Volckaert, Bruno; De Turck, Filip (2023). Auto-scaling dataset based on the gym-hpa framework [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7944660
Explore at:
Dataset updated
Jun 9, 2023
Dataset provided by
Ghent University - imec - IDLab
Authors
Santos, Jose; Wauters, Tim; Volckaert, Bruno; De Turck, Filip
Description
The implemented gym-hpa is a custom OpenAi Gym environment for the training of Reinforcement Learning (RL) agents for auto-scaling research in the Kubernetes (K8s) platform.

Two environments exist based on the Redis Cluster and Online Boutique applications.

Two collected datasets are shared here. The code has been released here: https://github.com/jpedro1992/gym-hpa

Related Publication: Santos, J. et al. "gym-hpa: Efficient auto-scaling via reinforcement learning for complex microservice-based applications in Kubernetes." NOMS2023, the IEEE/IFIP Network Operations and Management Symposium. 2023.
Binary classification using a confusion matrix.
plos.figshare.com
xls
Updated Dec 6, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Chantha Wongoutong (2024). Binary classification using a confusion matrix. [Dataset]. http://doi.org/10.1371/journal.pone.0310839.t002
Explore at:
xlsAvailable download formats
Unique identifier
https://doi.org/10.1371/journal.pone.0310839.t002
Dataset updated
Dec 6, 2024
Dataset provided by
PLOShttp://plos.org/
Authors
Chantha Wongoutong
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.
g
Data from: The impact of variation in scaling factors on the estimation of...
gimi9.com
s.cnmilf.com
+1more
Updated Nov 24, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2018). The impact of variation in scaling factors on the estimation of internal dose metrics: a case study using bromodichloromethane (BDCM) [Dataset]. https://gimi9.com/dataset/data-gov_the-impact-of-variation-in-scaling-factors-on-the-estimation-of-internal-dose-metrics-a-ca/
Explore at:
Dataset updated
Nov 24, 2018
Description
This dataset contains model code and supporting analysis files necessary to evaluate the impact of variability in human hepatic scaling factors. Variation in scaling factor values impacts metabolic rate parameter estimates (Vmax) and hence estimates of internal dose used in dose response analysis and biomarkers of exposure that are important for interpretation of epidemiology studies. This dataset is associated with the following publication: Kenyon, E., C. Eklund, J. Lipscomb, and R. Pegram. The impact of variation in scaling factors on the estimation of internal dose metrics: a case study using bromodichloromethane (BDCM).1. Toxicology Mechanisms and Methods. Taylor & Francis, Inc., Philadelphia, PA, USA, 26(8): 620-626, (2016).
Musical Scale Classification Dataset using Chroma
kaggle.com
zip
Updated Apr 8, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Om Avashia (2025). Musical Scale Classification Dataset using Chroma [Dataset]. https://www.kaggle.com/datasets/omavashia/synthetic-scale-chromagraph-tensor-dataset
Explore at:
zip(392580911 bytes)Available download formats
Dataset updated
Apr 8, 2025
Authors
Om Avashia
License
https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Description
Dataset Description

Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale

This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.

What’s Inside

chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:

12 = the 12 pitch classes (C, C#, D, ... B)

T = time steps

scale_index: An integer label from 0–23 identifying the scale the sample belongs to

Use Cases

This dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows

Labels

Index Scale
0 C major
1 C# major
... ...
11 B major
12 C minor
... ...
23 B minor

Quick Load Example (PyTorch)

Chroma tensors are of shape [1, 12, T], where: - 1 is the channel dimension (for CNN input) - 12 represents the 12 pitch classes (C through B) - T is the number of time frames

import torch import pandas as pd from tqdm import tqdm df = pd.read_csv("/content/scale_dataset.csv") # Reconstruct chroma tensors X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])] y = df['scale_index'].tolist()

Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.

import torch import pandas as pd data = torch.load("chroma_tensors.pt") X_pt = data['X'] # list of [1, 12, 302] tensors y_pt = data['y'] # list of scale indices

How It Was Built

Notes generated from random melodies using music21

MIDI converted to WAV via FluidSynth

Chromagrams extracted with librosa.feature.chroma_stft

Tensors flattened and saved alongside scale index labels

File Format

Column Type Description
chroma_tensor str Flattened 1D chroma tensor [1×12×T]
scale_index int Label from 0 to 23

Notes

Data is synthetic but musically valid and well-balanced

Each of the 24 scales appears 300 times

All tensors have fixed length (T) for easy batching
i
Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...
ieee-dataport.org
Updated Aug 10, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Nirmalya Thakur (2022). A Large-Scale Dataset of Twitter Chatter about Online Learning during the Current COVID-19 Omicron Wave [Dataset]. https://ieee-dataport.org/documents/large-scale-dataset-twitter-chatter-about-online-learning-during-current-covid-19-omicron
Explore at:
Dataset updated
Aug 10, 2022
Authors
Nirmalya Thakur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
no. 8
d
New Visions for Large Scale Networks: Research and Applications
catalog.data.gov
datasets.ai
+3more
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NCO NITRD (2025). New Visions for Large Scale Networks: Research and Applications [Dataset]. https://catalog.data.gov/dataset/new-visions-for-large-scale-networks-research-and-applications
Explore at:
Dataset updated
May 14, 2025
Dataset provided by
NCO NITRD
Description
This paper documents the findings of the March 12-14, 2001 Workshop on New Visions for Large-Scale Networks: Research and Applications. The workshops objectives were to develop a vision for the future of networking 10 to 20 years out and to identify needed Federal networking research to enable that vision...
4
Learning Curves Database 1.1
data.4tu.nl
zip
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cheng Yan; Felix Mohr; Tom Viering (2025). Learning Curves Database 1.1 [Dataset]. http://doi.org/10.4121/3bd18108-fad0-4e4c-affd-4341fba99306.v1
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.4121/3bd18108-fad0-4e4c-affd-4341fba99306.v1
Dataset updated
May 27, 2025
Dataset provided by
4TU.ResearchData
Authors
Cheng Yan; Felix Mohr; Tom Viering
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Sample-wise learning curves plot performance versus training set size. They are useful for studying scaling laws and speeding up hyperparameter tuning and model selection. Learning curves are often assumed to be well-behaved: monotone (i.e. improving with more data) and convex. By constructing the Learning Curves Database 1.1 (LCDB 1.1), a large-scale database with high-resolution learning curves including more modern learners (CatBoost, TabNet, RealMLP and TabPFN), we show that learning curves are less often well-behaved than previously thought. Using statistically rigorous methods, we observe significant ill-behavior in approximately 15% of the learning curves, almost twice as much as in previous estimates. We also identify which learners are to blame and show that specific learners are more ill-behaved than others. Additionally, we demonstrate that different feature scalings rarely resolve ill-behavior. We evaluate the impact of ill-behavior on downstream tasks, such as learning curve fitting and model selection, and find it poses significant challenges, underscoring the relevance and potential of LCDB 1.1 as a challenging benchmark for future research.
Z
Dataset used in Design Analytics for Mobile Learning: Scaling up...
data.niaid.nih.gov
nde-dev.biothings.io
+1more
Updated Mar 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gerti (2022). Dataset used in Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6320367
Explore at:
Dataset updated
Mar 1, 2022
Dataset provided by
Pishtari
Authors
Gerti
Description
The following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements".

Abstract

This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.
Raw data set for collaborative research with Univ of Toledo on BAF...
catalog.data.gov
s.cnmilf.com
Updated Aug 3, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
U.S. EPA Office of Research and Development (ORD) (2024). Raw data set for collaborative research with Univ of Toledo on BAF full-scale study [Dataset]. https://catalog.data.gov/dataset/raw-data-set-for-collaborative-research-with-univ-of-toledo-on-baf-full-scale-study
Explore at:
Dataset updated
Aug 3, 2024
Dataset provided by
United States Environmental Protection Agencyhttp://www.epa.gov/
Description
The dataset includes the results of DNA concentrations, barcode 16S information for DNA sequencing, and data analysis. This dataset is associated with the following publication: Jeon, Y., l. li, M. Bhatia, H. Ryu, J. SantoDomingo, J. Brown, J. Goetz, and y. seo. Impacts of severe harmful algal blooms on bacterial communities in full-scale biological filtration systems for drinking water treatment. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 927: e171301, (2024).
d
Data from: Comparing cal3 and other a posteriori time-scaling approaches in...
datadryad.org
data.niaid.nih.gov
+1more
zip
Updated Aug 13, 2016
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David W. Bapst; Melanie J. Hopkins (2016). Comparing cal3 and other a posteriori time-scaling approaches in a case study with the pterocephaliid trilobites [Dataset]. http://doi.org/10.5061/dryad.292dd
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.292dd
Dataset updated
Aug 13, 2016
Dataset provided by
Dryad
Authors
David W. Bapst; Melanie J. Hopkins
Time period covered
Aug 6, 2016
Area covered
Global
Description
READMEREADME file describes the file type and contents of all files in this data repository.PteroGW-prunedThe pruned most-parsimonious tree of the pterocephaliid trilobites from Hopkins (2011) is provided as a NEXUS filePCscores-pterocephaliidThe principal component scores from a PCA of the geometric morphometric landmark data for those pterocephaliid taxa is provided as a tab-delimitted .txt tableFAD-LAD-rescaled-pterotab-delimited tables of first and last appearance dates across 100 CONOP solutions for just the pterocephaliidsFAD-LAD-rescaled-alltab-delimited tables of first and last appearance dates across 100 CONOP solutions for all taxaptero-minsamplingminimum number of horizons each taxon was found at for just pterocephaliids (tab-delimited)all-minsamplingminimum number of horizons each taxon was found at for just pterocephaliids (tab-delimited)trilobite_cal3_03-25-16Rmarkdown script used for all analyses in this studytrilobite_cal3_03-25-16.pdfPDF created by knitting the Rmarkdow...

Index	Scale
0	C major
1	C# major
...	...
11	B major
12	C minor
...	...
23	B minor

Column	Type	Description
`chroma_tensor`	`str`	Flattened 1D chroma tensor `[1×12×T]`
`scale_index`	`int`	Label from 0 to 23

Facebook

Twitter

Click to copy link

Link copied

Cite

Muhammad Abbas (2024). Data scaling using machine learning [Dataset]. https://www.kaggle.com/datasets/muuhamadabbas/data-scaling-using-machine-learning

Data scaling using machine learning

Explore at:

zip(1688 bytes)Available download formats

Dataset updated

May 9, 2024

Authors

Muhammad Abbas

Description

Dataset

This dataset was created by Muhammad Abbas

Clear search

Close search

Google apps

Main menu

Data scaling using machine learning

Dataset

Contents

Europe Data for Feature Scaling

Dataset

Contents

Data from: Data Scaling and Generalization Insights for Medicinal Chemistry...

The datasets used in this research.

Rescaled Fashion-MNIST dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

The performance results for k-means clustering and testing the hypothesis...

Data from: Study of scaling in hadronic production of dimuons

Replication data for: Mokken Scale Analysis: A Nonparametric Version of...

Dataset used in Design Analytics for Mobile Learning: Scaling up...

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Data from: Auto-scaling dataset based on the gym-hpa framework

Binary classification using a confusion matrix.

Data from: The impact of variation in scaling factors on the estimation of...

Musical Scale Classification Dataset using Chroma

Dataset Description

What’s Inside

Use Cases

Labels

Quick Load Example (PyTorch)

How It Was Built

File Format

Notes

Data from: A Large-Scale Dataset of Twitter Chatter about Online Learning...

New Visions for Large Scale Networks: Research and Applications

Learning Curves Database 1.1

Dataset used in Design Analytics for Mobile Learning: Scaling up...

Raw data set for collaborative research with Univ of Toledo on BAF...

Data from: Comparing cal3 and other a posteriori time-scaling approaches in...

Data scaling using machine learning

Dataset

Contents