Facebook
TwitterThis dataset was created by Muhammad Abbas
Facebook
TwitterThis dataset was created by Study Mart
Facebook
TwitterPredictive models hold considerable promise in enabling the faster discovery of safer, more efficacious therapeutics. To better understand and improve the performance of small-molecule predictive models for drug discovery, we conduct multiple experiments with deep learning and traditional machine learning approaches, leveraging our large internal data sets as well as publicly available data sets. The experiments include assessing performance on random, temporal, and reverse-temporal data ablation tasks as well as tasks testing model extrapolation to different property spaces. We identify factors that contribute to the higher performance of predictive models built using graph neural networks compared to traditional methods such as XGBoost and random forest. These insights were successfully used to develop a scaling relationship that explains 81% of the variance in model performance across various assays and data regimes. This relationship can be used to estimate the performance of models for ADMET (absorption, distribution, metabolism, excretion, and toxicity) end points, as well as for drug discovery assay data more broadly. The findings offer guidance for further improving model performance in drug discovery.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.
Facebook
TwitterThe goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled Fashion-MNIST dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.
The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:
[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.
The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5
Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5
These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The performance results for k-means clustering and testing the hypothesis for homogeneity between the true grouped data and feature scaling on datasets containing features with different units.
Facebook
TwitterPLAB=200,300,400 GEV/C. COLUMBIA-FERMILAB-STONY BROOK COLLABORATION.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
This article introduces a model of ordinal unidimensional measurement known as Mokken scale analysis. Mokken scaling is based on principles of Item Response Theory (IRT) that originated in the Guttman scale. I compare the Mokken model with both Classical Test Theory (reliability or factor analysis) and parametric IRT models (especially with the one-parameter logistic model known as the Rasch model). Two nonparametric probabilistic versions of the Mokken model are described: the model of Monotone Homogeneity and the model of Double Monotonicity. I give procedures for dealing with both dichotomous and polytomous data, along with two scale analyses of data from the World Values Study that demonstrate the usefulness of the Mokken model.
Facebook
TwitterThe following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements".
Abstract
This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.
Facebook
TwitterThe goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled CIFAR-10 dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2
and is therefore significantly more challenging.
The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:
[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.
The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5
Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5
These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Facebook
TwitterThe implemented gym-hpa is a custom OpenAi Gym environment for the training of Reinforcement Learning (RL) agents for auto-scaling research in the Kubernetes (K8s) platform.
Two environments exist based on the Redis Cluster and Online Boutique applications.
Two collected datasets are shared here. The code has been released here: https://github.com/jpedro1992/gym-hpa
Related Publication: Santos, J. et al. "gym-hpa: Efficient auto-scaling via reinforcement learning for complex microservice-based applications in Kubernetes." NOMS2023, the IEEE/IFIP Network Operations and Management Symposium. 2023.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Despite the popularity of k-means clustering, feature scaling before applying it can be an essential yet often neglected step. In this study, feature scaling via five methods: Z-score, Min-Max normalization, Percentile transformation, Maximum absolute scaling, or RobustScaler beforehand was compared with using the raw (i.e., non-scaled) data to analyze datasets having features with different or the same units via k-means clustering. The results of an experimental study show that, for features with different units, scaling them before k-means clustering provided better accuracy, precision, recall, and F-score values than when using the raw data. Meanwhile, when features in the dataset had the same unit, scaling them beforehand provided similar results to using the raw data. Thus, scaling the features beforehand is a very important step for datasets with different units, which improves the clustering results and accuracy. Of the five feature-scaling methods used in the dataset with different units, Z-score standardization and Percentile transformation provided similar performances that were superior to the other or using the raw data. While Maximum absolute scaling, slightly more performances than the other scaling methods and raw data when the dataset contains features with the same unit, the improvement was not significant.
Facebook
TwitterThis dataset contains model code and supporting analysis files necessary to evaluate the impact of variability in human hepatic scaling factors. Variation in scaling factor values impacts metabolic rate parameter estimates (Vmax) and hence estimates of internal dose used in dose response analysis and biomarkers of exposure that are important for interpretation of epidemiology studies. This dataset is associated with the following publication: Kenyon, E., C. Eklund, J. Lipscomb, and R. Pegram. The impact of variation in scaling factors on the estimation of internal dose metrics: a case study using bromodichloromethane (BDCM).1. Toxicology Mechanisms and Methods. Taylor & Francis, Inc., Philadelphia, PA, USA, 26(8): 620-626, (2016).
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale
This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.
chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
12 = the 12 pitch classes (C, C#, D, ... B)T = time steps scale_index: An integer label from 0–23 identifying the scale the sample belongs toThis dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows
| Index | Scale |
|---|---|
| 0 | C major |
| 1 | C# major |
| ... | ... |
| 11 | B major |
| 12 | C minor |
| ... | ... |
| 23 | B minor |
Chroma tensors are of shape [1, 12, T], where:
- 1 is the channel dimension (for CNN input)
- 12 represents the 12 pitch classes (C through B)
- T is the number of time frames
import torch
import pandas as pd
from tqdm import tqdm
df = pd.read_csv("/content/scale_dataset.csv")
# Reconstruct chroma tensors
X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
y = df['scale_index'].tolist()
Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.
import torch
import pandas as pd
data = torch.load("chroma_tensors.pt")
X_pt = data['X'] # list of [1, 12, 302] tensors
y_pt = data['y'] # list of scale indices
music21FluidSynthlibrosa.feature.chroma_stft| Column | Type | Description |
|---|---|---|
chroma_tensor | str | Flattened 1D chroma tensor [1×12×T] |
scale_index | int | Label from 0 to 23 |
T) for easy batching
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
no. 8
Facebook
TwitterThis paper documents the findings of the March 12-14, 2001 Workshop on New Visions for Large-Scale Networks: Research and Applications. The workshops objectives were to develop a vision for the future of networking 10 to 20 years out and to identify needed Federal networking research to enable that vision...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sample-wise learning curves plot performance versus training set size. They are useful for studying scaling laws and speeding up hyperparameter tuning and model selection. Learning curves are often assumed to be well-behaved: monotone (i.e. improving with more data) and convex. By constructing the Learning Curves Database 1.1 (LCDB 1.1), a large-scale database with high-resolution learning curves including more modern learners (CatBoost, TabNet, RealMLP and TabPFN), we show that learning curves are less often well-behaved than previously thought. Using statistically rigorous methods, we observe significant ill-behavior in approximately 15% of the learning curves, almost twice as much as in previous estimates. We also identify which learners are to blame and show that specific learners are more ill-behaved than others. Additionally, we demonstrate that different feature scalings rarely resolve ill-behavior. We evaluate the impact of ill-behavior on downstream tasks, such as learning curve fitting and model selection, and find it poses significant challenges, underscoring the relevance and potential of LCDB 1.1 as a challenging benchmark for future research.
Facebook
TwitterThe following dataset has been used for the paper entitled "Design Analytics for Mobile Learning: Scaling up theClassification of Learning Designs based onCognitive and Contextual Elements".
Abstract
This research was triggered by the identified need in literature for large-scale studies about the kind of designs that teachers create for Mobile Learning (m-learning). These studies require analyses of large datasets of learning designs. The common approach followed by researchers when analysing designs has been to manually classify them following high-level pedagogically-guided coding strategies, which demands extensive work. Therefore, the first goal of this paper is to explore the use of Supervised Machine Learning (SML) to automatically classify the textual content of m-learning designs, through pedagogically-relevant classifications, such as the cognitive level demanded by students to carry out specific designed tasks, the phases of inquiry learning represented in the designs, or the role that the situated environment has in them. As not all the SML models are transparent, while often researchers need to understand the behaviour behind them, the second goal of this paper considers the trade-off between models’ performance and interpretability in the context of design analytics for m-learning. To achieve these goals we compiled a dataset of designs deployed through two tools, Avastusrada and Smartzoos. With it, we trained and compared different models and feature extraction techniques. We further optimized andcompared the best-performing and most interpretable algorithms (EstBERT and Logistic Regression) to consider the second goal through an illustrative case. We found that SML can reliably classify designs, with accuracy>0.86and Cohen’s kappa>0.69.
Facebook
TwitterThe dataset includes the results of DNA concentrations, barcode 16S information for DNA sequencing, and data analysis. This dataset is associated with the following publication: Jeon, Y., l. li, M. Bhatia, H. Ryu, J. SantoDomingo, J. Brown, J. Goetz, and y. seo. Impacts of severe harmful algal blooms on bacterial communities in full-scale biological filtration systems for drinking water treatment. SCIENCE OF THE TOTAL ENVIRONMENT. Elsevier BV, AMSTERDAM, NETHERLANDS, 927: e171301, (2024).
Facebook
TwitterREADMEREADME file describes the file type and contents of all files in this data repository.PteroGW-prunedThe pruned most-parsimonious tree of the pterocephaliid trilobites from Hopkins (2011) is provided as a NEXUS filePCscores-pterocephaliidThe principal component scores from a PCA of the geometric morphometric landmark data for those pterocephaliid taxa is provided as a tab-delimitted .txt tableFAD-LAD-rescaled-pterotab-delimited tables of first and last appearance dates across 100 CONOP solutions for just the pterocephaliidsFAD-LAD-rescaled-alltab-delimited tables of first and last appearance dates across 100 CONOP solutions for all taxaptero-minsamplingminimum number of horizons each taxon was found at for just pterocephaliids (tab-delimited)all-minsamplingminimum number of horizons each taxon was found at for just pterocephaliids (tab-delimited)trilobite_cal3_03-25-16Rmarkdown script used for all analyses in this studytrilobite_cal3_03-25-16.pdfPDF created by knitting the Rmarkdow...
Facebook
TwitterThis dataset was created by Muhammad Abbas