Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.
Facebook
TwitterThis dataset was created by abhishek
Facebook
TwitterThe goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled Fashion-MNIST dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.
The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:
[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.
The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5
Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5
These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Facebook
Twitterhttps://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/
Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale
This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.
chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
12 = the 12 pitch classes (C, C#, D, ... B)T = time steps scale_index: An integer label from 0–23 identifying the scale the sample belongs toThis dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows
| Index | Scale |
|---|---|
| 0 | C major |
| 1 | C# major |
| ... | ... |
| 11 | B major |
| 12 | C minor |
| ... | ... |
| 23 | B minor |
Chroma tensors are of shape [1, 12, T], where:
- 1 is the channel dimension (for CNN input)
- 12 represents the 12 pitch classes (C through B)
- T is the number of time frames
import torch
import pandas as pd
from tqdm import tqdm
df = pd.read_csv("/content/scale_dataset.csv")
# Reconstruct chroma tensors
X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
y = df['scale_index'].tolist()
Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.
import torch
import pandas as pd
data = torch.load("chroma_tensors.pt")
X_pt = data['X'] # list of [1, 12, 302] tensors
y_pt = data['y'] # list of scale indices
music21FluidSynthlibrosa.feature.chroma_stft| Column | Type | Description |
|---|---|---|
chroma_tensor | str | Flattened 1D chroma tensor [1×12×T] |
scale_index | int | Label from 0 to 23 |
T) for easy batching
Facebook
TwitterDataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation
pip install pandas pyarrow Example
import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])
dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Facebook
TwitterWikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.
There are two features: - text: wikihow answers texts. - headline: bold lines as summary.
There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.
Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('wikihow', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThe goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled CIFAR-10 dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2
and is therefore significantly more challenging.
The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:
[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.
The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5
Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5
These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
The Cloud Task Scheduling Dataset represents large-scale workload management across heterogeneous computing environments, including cloud, fog, and edge systems. It contains realistic data capturing the behavior of distributed tasks and virtual machines under varying computational loads and network conditions.
The dataset includes over 6000 tasks with parameters such as task length, priority, deadline, memory, bandwidth, execution time, completion time, energy use, and resource utilization metrics. Performance indicators such as makespan, cost, response time, imbalance, storage efficiency, and network path load are also provided.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
CoRNStack Python Dataset
The CoRNStack Dataset, accepted to ICLR 2025, is a large-scale high quality training dataset specifically for code retrieval across multiple programming languages. This dataset comprises of
CoRNStack Dataset Curation
Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out… See the full description on the dataset page: https://huggingface.co/datasets/nomic-ai/cornstack-python-v1.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We produce a dataset that uses pattern scaling, a common method of emulating climate models. Our dataset is built on the Pangeo CMIP6 archive, which has the advantage that we don't need to actually download the climate model output. Here we demonstrate the utility of our dataset, called Pangeo-Enabled ESM Pattern Scaling (PEEPS). The dataset, which is encapsulated in a Jupyter notebook (and replicated in a Python file), is flexible and can be extended to multiple scenarios and multiple variables, as long as they are in the Pangeo-accessible archive.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Dataset Description
CodeRM-UnitTest dataset originates from the paper: Dynamic Scaling of Unit Tests for Code Reward Modeling available on arXiv. You can visit the homepage to learn more about the paper. It is a curated collection of high-quality synthetic Python unit tests, derived from two prominent code instruction tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO. This dataset is used for training CodeRM-8B, a small yet powerful unit test… See the full description on the dataset page: https://huggingface.co/datasets/KAKA22/CodeRM-UnitTest.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset for my thesis titled "Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software.
Contains:
Labeled issues data used to train and test the quantifier models.
Post-quantification datasets the analyses were performed on.
Scripts (R, Python) for all analyses performed.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
PySecDB: security commit dataset in Python
Description
To foster large-scale research on vulnerability mitigation and to enable a comparison of different detection approaches, we make our dataset PySecDB from our ICSME23 paper publicly available. PySecDB is a real-world Python security commit dataset that contains around 1.2K security commits and 2.8K non-security commits. You can find more details on the dataset in the paper "Exploring Security Commits in Python".… See the full description on the dataset page: https://huggingface.co/datasets/sunlab/PySecDB.
Facebook
TwitterThe WinoGrande, a large-scale dataset of 44k problems, inspired by the original Winograd Schema Challenge design, but adjusted to improve both the scale and the hardness of the dataset.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('winogrande', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Facebook
TwitterThis dataset was created by Kaihua Zhang
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.
This repository contains two files:
The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.
The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:
In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.
Reproducing the Analysis
This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:
Ubuntu 18.04.1 LTS
PostgreSQL 10.6
Conda 4.5.11
Python 3.7.2
PdfCrop 2012/11/02 v1.38
First, download dump.tar.bz2 and extract it:
tar -xjf dump.tar.bz2
It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:
psql jupyter < db2019-03-13.dump
It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Create a conda environment with Python 3.7:
conda create -n analyses python=3.7
conda activate analyses
Go to the analyses folder and install all the dependencies of the requirements.txt
cd jupyter_reproducibility/analyses
pip install -r requirements.txt
For reproducing the analyses, run jupyter on this folder:
jupyter notebook
Execute the notebooks on this order:
Reproducing or Expanding the Collection
The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.
Requirements
This time, we have extra requirements:
All the analysis requirements
lbzip2 2.5
gcc 7.3.0
Github account
Gmail account
Environment
First, set the following environment variables:
export JUP_MACHINE="db"; # machine identifier
export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
export JUP_COMPRESSION="lbzip2"; # compression program
export JUP_VERBOSE="5"; # verbose level
export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
export JUP_GITHUB_USERNAME="github_username"; # your github username
export JUP_GITHUB_PASSWORD="github_password"; # your github password
export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
export JUP_WITH_EXECUTION="1"; # run execute python notebooks
export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
export JUP_EXECUTION_MODE="-1"; # run following the execution order
export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
# Frequenci of log report
export JUP_ASTROID_FREQUENCY="5";
export JUP_IPYTHON_FREQUENCY="5";
export JUP_NOTEBOOKS_FREQUENCY="5";
export JUP_REQUIREMENT_FREQUENCY="5";
export JUP_CRAWLER_FREQUENCY="1";
export JUP_CLONE_FREQUENCY="1";
export JUP_COMPRESS_FREQUENCY="5";
export JUP_DB_IP="localhost"; # postgres database IP
Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf
Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.
Scripts
Download and extract jupyter_reproducibility.tar.bz2:
tar -xjf jupyter_reproducibility.tar.bz2
Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):
Conda 2.7
conda create -n raw27 python=2.7 -y
conda activate raw27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 2.7
conda create -n py27 python=2.7 anaconda -y
conda activate py27
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.4
It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.
conda create -n raw34 python=3.4 -y
conda activate raw34
conda install jupyter -c conda-forge -y
conda uninstall jupyter -y
pip install --upgrade pip
pip install jupyter
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
pip install pathlib2
Anaconda 3.4
conda create -n py34 python=3.4 anaconda -y
conda activate py34
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.5
conda create -n raw35 python=3.5 -y
conda activate raw35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.5
It requires the manual installation of other anaconda packages.
conda create -n py35 python=3.5 anaconda -y
conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
conda activate py35
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.6
conda create -n raw36 python=3.6 -y
conda activate raw36
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Anaconda 3.6
conda create -n py36 python=3.6 anaconda -y
conda activate py36
conda install -y anaconda-navigator jupyterlab_server navigator-updater
pip install --upgrade pip
pip install pipenv
pip install -e jupyter_reproducibility/archaeology
Conda 3.7
<code
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset overview
This dataset provides data and images of snowflakes in free fall collected with a Multi-Angle Snowflake Camera (MASC) The dataset includes, for each recorded snowflakes:
A triplet of gray-scale images corresponding to the three cameras of the MASC
A large quantity of geometrical, textural descriptors and the pre-compiled output of published retrieval algorithms as well as basic environmental information at the location and time of each measurement.
The pre-computed descriptors and retrievals are available either individually for each camera view or, some of them, available as descriptors of the triplet as a whole. A non exhaustive list of precomputed quantities includes for example:
Textural and geometrical descriptors as in Praz et al 2017
Hydrometeor classification, riming degree estimation, melting identification, as in Praz et al 2017
Blowing snow identification, as in Schaer et al 2020
Mass, volume, gyration estimation, as in Leinonen et al 2021
Data format and structure
The dataset is divided into four .parquet file (for scalar descriptors) and a Zarr database (for the images). A detailed description of the data content and of the data records is available here.
Supporting code
A python-based API is available to manipulate, display and organize the data of our dataset. It can be found on GitHub. See also the code documentation on ReadTheDocs.
Download notes
All files available here for download should be stored in the same folder, if the python-based API is used
MASCdb.zarr.zip must be unzipped after download
Field campaigns
A list of campaigns included in the dataset, with a minimal description is given in the following table
Campaign_name
Information
Shielded / Not shielded
DFIR = Double Fence Intercomparison Reference
APRES3-2016 & APRES3-2017
Instrument installed in Antarctica in the context of the APRES3 project. See for example Genthon et al, 2018 or Grazioli et al 2017
Not shielded
Davos-2015
Instrument installed in the Swiss Alps within the context of SPICE (Solid Precipitation InterComparison Experiment)
Shielded (DFIR)
Davos-2019
Instrument installed in the Swiss Alps within the context of RACLETS (Role of Aerosols and CLouds Enhanced by Topography on Snow)
Not shielded
ICEGENESIS-2021
Instrument installed in the Swiss Jura in a MeteoSwiss ground measurement site, within the context of ICE-GENESIS. See for example Billault-Roux et al, 2023
Not shielded
ICEPOP-2018
Instrument installed in Korea, in the context of ICEPOP. See for example Gehring et al 2021.
Shielded (DFIR)
Jura-2019 & Jura-2023
Instrument installed in the Swiss Jura within a MeteoSwiss measurement site
Not shielded
Norway-2016
Instrument installed in Norway during the High-Latitude Measurement of Snowfall (HiLaMS). See for example Cooper et al, 2022.
Not shielded
PLATO-2019
Instrument installed in the "Davis" Antarctic base during the PLATO field campaign
Not shielded
POPE-2020
Instrument installed in the "Princess Elizabeth Antarctica" base during the POPE campaign. See for example Ferrone et al, 2023.
Not shielded
Remoray-2022
Instrument installed in the French Jura.
Not shielded
Valais-2016
Instrument installed in the Swiss Alps in a ski resort.
Not shielded
Version
1.0 - Two new campaigns ("Jura-2023", "Norway-2016") added. Added references and list of campaigns.
0.3 - a new campaign is added to the dataset ("Remoray-2022")
0.2 - rename of variables. Variable precision (digits) standardized
0.1 - first upload
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The experiment is based on the common high speed milling data set to verify the robustness of the model to various tool types. The data set contains six sub data sets, corresponding to the wear process of six different types of tools. Three of the sub data sets contain tool wear labels, while the other three sub data sets do not. The tools used are all three edged 6mm ball cemented carbide tools, but their geometry and coating are different. The workpiece is Inconel 718, which is widely used for jet engine blade milling. The spindle speed is 10360rpm, and the cutting depth is 0.25mm. The tool cuts from the upper edge of the workpiece surface to the lower edge in a zigzag manner. In the whole milling process, the cutting length of each tool is about 0.1125m × 315pass = 35.44m. The cutting signal in Experiment 1 includes the cutting force signal collected by the three channel Kistler dynamometer and the vibration signal collected by the three channel Kistler accelerometer at a sampling rate of 50 kHz. Use the microscope LEICA MZ12 to measure the wear of the rear tool surface of the three teeth offline after each tool feeding. In this experiment, a cutting signal is collected every other period of time to predict the wear of the three teeth of the tool.The samples are divided into training set, evaluation set, test set and reconstruction set. The training set and evaluation set samples are from two kinds of tools, including 30000 and 4096 samples respectively; The samples of the test set are from another tool, including 9472 samples; The reconstruction set comes from the unlabeled data generated by the other three tools, including 40832 samples. Each sample contains three channels of cutting force signal and three channels of vibration signal. The sampling points of each channel signal are 2304. The following preprocessing steps are performed:1) Signal clippingSince the feed rate and sampling rate are constant throughout the experiment, the data set of each experiment can be approximately understood as a signal matrix evenly distributed on the workpiece surface, ignoring the slight difference in the number of sampling points for each tool path. The ordinate of the matrix corresponds to the index of the tool path times, and the abscissa corresponds to the index of the sampling point. Because the generation rules of cutting signals are different in uncut, cut in, cut out and stable states, the sampling points close to the edge of the workpiece are removed. Here we simply cut 2% off the two ends of the cutting signal obtained by each tool feed.2) Data amplificationBecause tool wear can only be observed with a microscope after each tool feeding, each wear tag corresponds to a cutting signal containing about 120000 sampling points, and the acquisition of tool wear also takes a lot of time. In this case, the number of tags extracted is not enough to fit the model, nor can the robustness of the algorithm be guaranteed. It is necessary to artificially split the sample and expand the tool wear label. Considering that the tool wear is a slow and continuous process, and there is a certain deviation in the experimental measurement, the linear interpolation method is adopted here. We also tested quadratic interpolation and polynomial fitting methods, but no better results were observed. It needs to be stated here that the essence of prediction is to find a function that maps the sample space to the target space. For any point in the sample space, the model can find the corresponding value in the target space. What sample amplification does is to sample more times in the target space, so as to more comprehensively describe this mapping relationship, rather than redefining this relationship.The task of this study is to monitor the wear of the rear cutter surface of the three teeth according to the six channel sensor signals. On the test set, the mean square error (MSE) and mean absolute percentage error (MAPE) between the predicted value and the observed value of the microscope are 0.0013 and 4%, respectively, and the average and maximum final prediction error (FPE) are 5 μ M and 23 μ m. The training time was 2130s, and the single prediction time was 1.79ms. The accuracy, training time and detection efficiency of tool wear monitoring can meet the current industrial needs. As MPAN realizes the mapping from cutting signal to tool wear, as the gate of control information flow, attention unit retains the importance information of input features. The predicted tool wear curve is basically consistent with the curve observed by the microscope.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.
The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.
Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.
The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.
Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).
As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).
The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.