100+ datasets found
  1. Feature Scaling using Python

    • kaggle.com
    zip
    Updated Nov 25, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    abhishek (2020). Feature Scaling using Python [Dataset]. https://www.kaggle.com/datasets/abhishekgupta31/feature-scaling-using-python
    Explore at:
    zip(135359 bytes)Available download formats
    Dataset updated
    Nov 25, 2020
    Authors
    abhishek
    Description

    Dataset

    This dataset was created by abhishek

    Contents

  2. Data from: Code4ML: a Large-scale Dataset of annotated Machine Learning Code...

    • zenodo.org
    csv
    Updated Sep 15, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonymous authors; Anonymous authors (2023). Code4ML: a Large-scale Dataset of annotated Machine Learning Code [Dataset]. http://doi.org/10.5281/zenodo.6607065
    Explore at:
    csvAvailable download formats
    Dataset updated
    Sep 15, 2023
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonymous authors; Anonymous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    We present Code4ML: a Large-scale Dataset of annotated Machine Learning Code, a corpus of Python code snippets, competition summaries, and data descriptions from Kaggle.

    The data is organized in a table structure. Code4ML includes several main objects: competitions information, raw code blocks collected form Kaggle and manually marked up snippets. Each table has a .csv format.

    Each competition has the text description and metadata, reflecting competition and used dataset characteristics as well as evaluation metrics (competitions.csv). The corresponding datasets can be loaded using Kaggle API and data sources.

    The code blocks themselves and their metadata are collected to the data frames concerning the publishing year of the initial kernels. The current version of the corpus includes two code blocks files: snippets from kernels up to the 2020 year (сode_blocks_upto_20.csv) and those from the 2021 year (сode_blocks_21.csv) with corresponding metadata. The corpus consists of 2 743 615 ML code blocks collected from 107 524 Jupyter notebooks.

    Marked up code blocks have the following metadata: anonymized id, the format of the used data (for example, table or audio), the id of the semantic type, a flag for the code errors, the estimated relevance to the semantic class (from 1 to 5), the id of the parent notebook, and the name of the competition. The current version of the corpus has ~12 000 labeled snippets (markup_data_20220415.csv).

    As marked up code blocks data contains the numeric id of the code block semantic type, we also provide a mapping from this number to semantic type and subclass (actual_graph_2022-06-01.csv).

    The dataset can help solve various problems, including code synthesis from a prompt in natural language, code autocompletion, and semantic code classification.

  3. Musical Scale Classification Dataset using Chroma

    • kaggle.com
    zip
    Updated Apr 8, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Om Avashia (2025). Musical Scale Classification Dataset using Chroma [Dataset]. https://www.kaggle.com/datasets/omavashia/synthetic-scale-chromagraph-tensor-dataset
    Explore at:
    zip(392580911 bytes)Available download formats
    Dataset updated
    Apr 8, 2025
    Authors
    Om Avashia
    License

    https://cdla.io/sharing-1-0/https://cdla.io/sharing-1-0/

    Description

    Dataset Description

    Musical Scale Dataset: 1900+ Chroma Tensors Labeled by Scale

    This dataset contains 1900+ unique synthetic musical audio samples generated from melodies in each of the 24 Western scales (12 major and 12 minor). Each sample has been converted into a chroma tensor, a 12-dimensional pitch class representation commonly used in music information retrieval (MIR) and deep learning tasks.

    What’s Inside

    • chroma_tensor: A JSON-safe formatted of a PyTorch tensor with shape [1, 12, T], where:
      • 12 = the 12 pitch classes (C, C#, D, ... B)
      • T = time steps
    • scale_index: An integer label from 0–23 identifying the scale the sample belongs to

    Use Cases

    This dataset is ideal for: - Training deep learning models (CNNs, MLPs) to classify musical scales - Exploring pitch-class distributions in Western tonal music - Prototyping models for music key detection, chord prediction, or tonal analysis - Teaching or demonstrating chromagram-based ML workflows

    Labels

    IndexScale
    0C major
    1C# major
    ......
    11B major
    12C minor
    ......
    23B minor

    Quick Load Example (PyTorch)

    Chroma tensors are of shape [1, 12, T], where: - 1 is the channel dimension (for CNN input) - 12 represents the 12 pitch classes (C through B) - T is the number of time frames

    import torch
    import pandas as pd
    from tqdm import tqdm
    
    df = pd.read_csv("/content/scale_dataset.csv")
    
    # Reconstruct chroma tensors
    X = [torch.tensor(eval(row)).reshape(1, 12, -1) for row in tqdm(df['chroma_tensor'])]
    y = df['scale_index'].tolist()
    

    Alternatively, you could directly load the chroma tensors and target scale indices using the .pt file.

    import torch
    import pandas as pd
    
    data = torch.load("chroma_tensors.pt")
    X_pt = data['X'] # list of [1, 12, 302] tensors
    y_pt = data['y'] # list of scale indices
    

    How It Was Built

    • Notes generated from random melodies using music21
    • MIDI converted to WAV via FluidSynth
    • Chromagrams extracted with librosa.feature.chroma_stft
    • Tensors flattened and saved alongside scale index labels

    File Format

    ColumnTypeDescription
    chroma_tensorstrFlattened 1D chroma tensor [1×12×T]
    scale_indexintLabel from 0 to 23

    Notes

    • Data is synthetic but musically valid and well-balanced
    • Each of the 24 scales appears 300 times
    • All tensors have fixed length (T) for easy batching
  4. Rescaled Fashion-MNIST dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Time period covered
    Apr 10, 2025
    Description

    Motivation

    The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled Fashion-MNIST dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

    Access and rights

    The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

    [4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

    The h5 files containing the dataset

    The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

    Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
    fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

    There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.

  5. Rescaled CIFAR-10 dataset

    • zenodo.org
    Updated Jun 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Jun 27, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
    Description

    Motivation

    The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

    The Rescaled CIFAR-10 dataset was introduced in the paper:

    [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, 67(29), https://doi.org/10.1007/s10851-025-01245-x.

    with a pre-print available at arXiv:

    [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

    Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

    [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

    and is therefore significantly more challenging.

    Access and rights

    The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

    [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

    and also for this new rescaled version, using the reference [1] above.

    The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

    The dataset

    The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

    There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

    The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

    The h5 files containing the dataset

    The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

    cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

    Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:

    cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
    cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

    These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

    Instructions for loading the data set

    The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
    ('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

    The training dataset can be loaded in Python as:

    with h5py.File(`

    x_train = np.array( f["/x_train"], dtype=np.float32)
    x_val = np.array( f["/x_val"], dtype=np.float32)
    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_train = np.array( f["/y_train"], dtype=np.int32)
    y_val = np.array( f["/y_val"], dtype=np.int32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

    x_train = np.transpose(x_train, (0, 3, 1, 2))
    x_val = np.transpose(x_val, (0, 3, 1, 2))
    x_test = np.transpose(x_test, (0, 3, 1, 2))

    The test datasets can be loaded in Python as:

    with h5py.File(`

    x_test = np.array( f["/x_test"], dtype=np.float32)
    y_test = np.array( f["/y_test"], dtype=np.int32)

    The test datasets can be loaded in Matlab as:

    x_test = h5read(`

    The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

  6. d

    Data and Code for: \"Universal Adaptive Normalization Scale (AMIS):...

    • search.dataone.org
    • dataverse.harvard.edu
    Updated Nov 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kravtsov, Gennady (2025). Data and Code for: \"Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System\" [Dataset]. http://doi.org/10.7910/DVN/BISM0N
    Explore at:
    Dataset updated
    Nov 15, 2025
    Dataset provided by
    Harvard Dataverse
    Authors
    Kravtsov, Gennady
    Description

    Dataset Title: Data and Code for: "Universal Adaptive Normalization Scale (AMIS): Integration of Heterogeneous Metrics into a Unified System" Description: This dataset contains source data and processing results for validating the Adaptive Multi-Interval Scale (AMIS) normalization method. Includes educational performance data (student grades), economic statistics (World Bank GDP), and Python implementation of the AMIS algorithm with graphical interface. Contents: - Source data: educational grades and GDP statistics - AMIS normalization results (3, 5, 9, 17-point models) - Comparative analysis with linear normalization - Ready-to-use Python code for data processing Applications: - Educational data normalization and analysis - Economic indicators comparison - Development of unified metric systems - Methodology research in data scaling Technical info: Python code with pandas, numpy, scipy, matplotlib dependencies. Data in Excel format.

  7. Cloud Task Scheduling Dataset

    • kaggle.com
    zip
    Updated Oct 16, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Python Developer (2025). Cloud Task Scheduling Dataset [Dataset]. https://www.kaggle.com/datasets/programmer3/cloud-task-scheduling-dataset
    Explore at:
    zip(796496 bytes)Available download formats
    Dataset updated
    Oct 16, 2025
    Authors
    Python Developer
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    The Cloud Task Scheduling Dataset represents large-scale workload management across heterogeneous computing environments, including cloud, fog, and edge systems. It contains realistic data capturing the behavior of distributed tasks and virtual machines under varying computational loads and network conditions.

    The dataset includes over 6000 tasks with parameters such as task length, priority, deadline, memory, bandwidth, execution time, completion time, energy use, and resource utilization metrics. Performance indicators such as makespan, cost, response time, imbalance, storage efficiency, and network path load are also provided.

  8. Z

    Data from: Pangeo-Enabled ESM Pattern Scaling (PEEPS): A customizable...

    • data.niaid.nih.gov
    Updated Jan 22, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kravitz, Ben; Snyder, Abigail (2023). Pangeo-Enabled ESM Pattern Scaling (PEEPS): A customizable dataset of emulated Earth System Model output [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_7139977
    Explore at:
    Dataset updated
    Jan 22, 2023
    Dataset provided by
    Indiana University
    Joint Global Change Research Institute
    Authors
    Kravitz, Ben; Snyder, Abigail
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Earth
    Description

    We produce a dataset that uses pattern scaling, a common method of emulating climate models. Our dataset is built on the Pangeo CMIP6 archive, which has the advantage that we don't need to actually download the climate model output. Here we demonstrate the utility of our dataset, called Pangeo-Enabled ESM Pattern Scaling (PEEPS). The dataset, which is encapsulated in a Jupyter notebook (and replicated in a Python file), is flexible and can be extended to multiple scenarios and multiple variables, as long as they are in the Pangeo-accessible archive.

  9. Python package Modin

    • kaggle.com
    zip
    Updated Nov 15, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Kaihua Zhang (2020). Python package Modin [Dataset]. https://www.kaggle.com/zhangkaihua88/pythonmodin
    Explore at:
    zip(481120 bytes)Available download formats
    Dataset updated
    Nov 15, 2020
    Authors
    Kaihua Zhang
    Description

    Dataset

    This dataset was created by Kaihua Zhang

    Contents

  10. Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter...

    • zenodo.org
    bz2
    Updated Mar 15, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana (2021). Dataset of A Large-scale Study about Quality and Reproducibility of Jupyter Notebooks [Dataset]. http://doi.org/10.5281/zenodo.2592524
    Explore at:
    bz2Available download formats
    Dataset updated
    Mar 15, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    João Felipe; João Felipe; Leonardo; Leonardo; Vanessa; Vanessa; Juliana; Juliana
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The self-documenting aspects and the ability to reproduce results have been touted as significant benefits of Jupyter Notebooks. At the same time, there has been growing criticism that the way notebooks are being used leads to unexpected behavior, encourage poor coding practices and that their results can be hard to reproduce. To understand good and bad practices used in the development of real notebooks, we analyzed 1.4 million notebooks from GitHub.

    Paper: https://2019.msrconf.org/event/msr-2019-papers-a-large-scale-study-about-quality-and-reproducibility-of-jupyter-notebooks

    This repository contains two files:

    • dump.tar.bz2
    • jupyter_reproducibility.tar.bz2

    The dump.tar.bz2 file contains a PostgreSQL dump of the database, with all the data we extracted from the notebooks.

    The jupyter_reproducibility.tar.bz2 file contains all the scripts we used to query and download Jupyter Notebooks, extract data from them, and analyze the data. It is organized as follows:

    • analyses: this folder has all the notebooks we use to analyze the data in the PostgreSQL database.
    • archaeology: this folder has all the scripts we use to query, download, and extract data from GitHub notebooks.
    • paper: empty. The notebook analyses/N12.To.Paper.ipynb moves data to it

    In the remaining of this text, we give instructions for reproducing the analyses, by using the data provided in the dump and reproducing the collection, by collecting data from GitHub again.

    Reproducing the Analysis

    This section shows how to load the data in the database and run the analyses notebooks. In the analysis, we used the following environment:

    Ubuntu 18.04.1 LTS
    PostgreSQL 10.6
    Conda 4.5.11
    Python 3.7.2
    PdfCrop 2012/11/02 v1.38

    First, download dump.tar.bz2 and extract it:

    tar -xjf dump.tar.bz2

    It extracts the file db2019-03-13.dump. Create a database in PostgreSQL (we call it "jupyter"), and use psql to restore the dump:

    psql jupyter < db2019-03-13.dump

    It populates the database with the dump. Now, configure the connection string for sqlalchemy by setting the environment variable JUP_DB_CONNECTTION:

    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter";

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Create a conda environment with Python 3.7:

    conda create -n analyses python=3.7
    conda activate analyses

    Go to the analyses folder and install all the dependencies of the requirements.txt

    cd jupyter_reproducibility/analyses
    pip install -r requirements.txt

    For reproducing the analyses, run jupyter on this folder:

    jupyter notebook

    Execute the notebooks on this order:

    • Index.ipynb
    • N0.Repository.ipynb
    • N1.Skip.Notebook.ipynb
    • N2.Notebook.ipynb
    • N3.Cell.ipynb
    • N4.Features.ipynb
    • N5.Modules.ipynb
    • N6.AST.ipynb
    • N7.Name.ipynb
    • N8.Execution.ipynb
    • N9.Cell.Execution.Order.ipynb
    • N10.Markdown.ipynb
    • N11.Repository.With.Notebook.Restriction.ipynb
    • N12.To.Paper.ipynb

    Reproducing or Expanding the Collection

    The collection demands more steps to reproduce and takes much longer to run (months). It also involves running arbitrary code on your machine. Proceed with caution.

    Requirements

    This time, we have extra requirements:

    All the analysis requirements
    lbzip2 2.5
    gcc 7.3.0
    Github account
    Gmail account

    Environment

    First, set the following environment variables:

    export JUP_MACHINE="db"; # machine identifier
    export JUP_BASE_DIR="/mnt/jupyter/github"; # place to store the repositories
    export JUP_LOGS_DIR="/home/jupyter/logs"; # log files
    export JUP_COMPRESSION="lbzip2"; # compression program
    export JUP_VERBOSE="5"; # verbose level
    export JUP_DB_CONNECTION="postgresql://user:password@hostname/jupyter"; # sqlchemy connection
    export JUP_GITHUB_USERNAME="github_username"; # your github username
    export JUP_GITHUB_PASSWORD="github_password"; # your github password
    export JUP_MAX_SIZE="8000.0"; # maximum size of the repositories directory (in GB)
    export JUP_FIRST_DATE="2013-01-01"; # initial date to query github
    export JUP_EMAIL_LOGIN="gmail@gmail.com"; # your gmail address
    export JUP_EMAIL_TO="target@email.com"; # email that receives notifications
    export JUP_OAUTH_FILE="~/oauth2_creds.json" # oauth2 auhentication file
    export JUP_NOTEBOOK_INTERVAL=""; # notebook id interval for this machine. Leave it in blank
    export JUP_REPOSITORY_INTERVAL=""; # repository id interval for this machine. Leave it in blank
    export JUP_WITH_EXECUTION="1"; # run execute python notebooks
    export JUP_WITH_DEPENDENCY="0"; # run notebooks with and without declared dependnecies
    export JUP_EXECUTION_MODE="-1"; # run following the execution order
    export JUP_EXECUTION_DIR="/home/jupyter/execution"; # temporary directory for running notebooks
    export JUP_ANACONDA_PATH="~/anaconda3"; # conda installation path
    export JUP_MOUNT_BASE="/home/jupyter/mount_ghstudy.sh"; # bash script to mount base dir
    export JUP_UMOUNT_BASE="/home/jupyter/umount_ghstudy.sh"; # bash script to umount base dir
    export JUP_NOTEBOOK_TIMEOUT="300"; # timeout the extraction
    
    
    # Frequenci of log report
    export JUP_ASTROID_FREQUENCY="5";
    export JUP_IPYTHON_FREQUENCY="5";
    export JUP_NOTEBOOKS_FREQUENCY="5";
    export JUP_REQUIREMENT_FREQUENCY="5";
    export JUP_CRAWLER_FREQUENCY="1";
    export JUP_CLONE_FREQUENCY="1";
    export JUP_COMPRESS_FREQUENCY="5";
    
    export JUP_DB_IP="localhost"; # postgres database IP

    Then, configure the file ~/oauth2_creds.json, according to yagmail documentation: https://media.readthedocs.org/pdf/yagmail/latest/yagmail.pdf

    Configure the mount_ghstudy.sh and umount_ghstudy.sh scripts. The first one should mount the folder that stores the directories. The second one should umount it. You can leave the scripts in blank, but it is not advisable, as the reproducibility study runs arbitrary code on your machine and you may lose your data.

    Scripts

    Download and extract jupyter_reproducibility.tar.bz2:

    tar -xjf jupyter_reproducibility.tar.bz2

    Install 5 conda environments and 5 anaconda environments, for each python version. In each of them, upgrade pip, install pipenv, and install the archaeology package (Note that it is a local package that has not been published to pypi. Make sure to use the -e option):

    Conda 2.7

    conda create -n raw27 python=2.7 -y
    conda activate raw27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 2.7

    conda create -n py27 python=2.7 anaconda -y
    conda activate py27
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    

    Conda 3.4

    It requires a manual jupyter and pathlib2 installation due to some incompatibilities found on the default installation.

    conda create -n raw34 python=3.4 -y
    conda activate raw34
    conda install jupyter -c conda-forge -y
    conda uninstall jupyter -y
    pip install --upgrade pip
    pip install jupyter
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology
    pip install pathlib2

    Anaconda 3.4

    conda create -n py34 python=3.4 anaconda -y
    conda activate py34
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.5

    conda create -n raw35 python=3.5 -y
    conda activate raw35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.5

    It requires the manual installation of other anaconda packages.

    conda create -n py35 python=3.5 anaconda -y
    conda install -y appdirs atomicwrites keyring secretstorage libuuid navigator-updater prometheus_client pyasn1 pyasn1-modules spyder-kernels tqdm jeepney automat constantly anaconda-navigator
    conda activate py35
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.6

    conda create -n raw36 python=3.6 -y
    conda activate raw36
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Anaconda 3.6

    conda create -n py36 python=3.6 anaconda -y
    conda activate py36
    conda install -y anaconda-navigator jupyterlab_server navigator-updater
    pip install --upgrade pip
    pip install pipenv
    pip install -e jupyter_reproducibility/archaeology

    Conda 3.7

    <code

  11. h

    CodeRM-UnitTest

    • huggingface.co
    Updated Jan 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zeyao Ma (2025). CodeRM-UnitTest [Dataset]. https://huggingface.co/datasets/KAKA22/CodeRM-UnitTest
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 20, 2025
    Authors
    Zeyao Ma
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    Dataset Description

    CodeRM-UnitTest dataset originates from the paper: Dynamic Scaling of Unit Tests for Code Reward Modeling available on arXiv. You can visit the homepage to learn more about the paper. It is a curated collection of high-quality synthetic Python unit tests, derived from two prominent code instruction tuning datasets: CodeFeedback-Filtered-Instruction and the training set of TACO. This dataset is used for training CodeRM-8B, a small yet powerful unit test… See the full description on the dataset page: https://huggingface.co/datasets/KAKA22/CodeRM-UnitTest.

  12. d

    Dataset for Large-Scale Analysis of Modern Code Review Practices and...

    • datadryad.org
    zip
    Updated Nov 28, 2017
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Thompson (2017). Dataset for Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software [Dataset]. http://doi.org/10.6078/D13M3J
    Explore at:
    zipAvailable download formats
    Dataset updated
    Nov 28, 2017
    Dataset provided by
    Dryad
    Authors
    Christopher Thompson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Nov 28, 2017
    Description

    Dataset for my thesis titled "Large-Scale Analysis of Modern Code Review Practices and Software Security in Open Source Software.

    Contains:

    Labeled issues data used to train and test the quantifier models.
    Post-quantification datasets the analyses were performed on.
    Scripts (R, Python) for all analyses performed.
    
  13. Metaverse Gait Authentication Dataset (MGAD)

    • figshare.com
    csv
    Updated Feb 11, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    sandeep ravikanti (2025). Metaverse Gait Authentication Dataset (MGAD) [Dataset]. http://doi.org/10.6084/m9.figshare.28387664.v1
    Explore at:
    csvAvailable download formats
    Dataset updated
    Feb 11, 2025
    Dataset provided by
    Figsharehttp://figshare.com/
    Authors
    sandeep ravikanti
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    1. Dataset OverviewThe Metaverse Gait Authentication Dataset (MGAD) is a large-scale dataset for gait-based biometric authentication in virtual environments. It consists of gait data from 5,000 simulated users, generated using Unity 3D and processed using OpenPose and MediaPipe. This dataset is ideal for researchers working on biometric authentication, gait analysis, and AI-driven identity verification systems.2. Data Structure & FormatFile Format: CSVNumber of Samples: 5,000 usersNumber of Features: 16 gait-based featuresColumns: Each row represents a user with corresponding gait feature valuesSize: Approximately (mention size in MB/GB after upload)3. Feature DescriptionsThe dataset includes 16 extracted gait features:Stride Length (m): Average distance covered in one gait cycle.Step Frequency (steps/min): Number of steps taken per minute.Stance Phase Duration (s): Stance phase in a gait cycle.Swing Phase Duration (s): Duration of the swing phase in a gait cycle.Double Support Phase Duration (s): Time both feet are in contact with the ground.Step Length (m): Distance between consecutive foot placements.Cadence Variability (%): Variability in step rate.Hip Joint Angle (°): Maximum angle variation in the hip joint.Knee Joint Angle (°): Maximum flexion-extension knee angle.Ankle Joint Angle (°): Angle variation at the ankle joint.Avg. Vertical GRF (N): Average vertical ground reaction force.Avg. Anterior-Posterior GRF (N): Ground reaction force in the forward-backward direction.Avg. Medial-Lateral GRF (N): Ground reaction force in the side-to-side direction.Avg. COP Excursion (mm): Center of pressure movement during stance phase.Foot Clearance during Swing Phase (mm): Minimum height of the foot during the swing phase.Gait Symmetry Index (%): Measure of symmetry between left and right gait cycles.4. How to Use the DatasetLoad the dataset in Python using Pandas:Use the features for machine learning models in biometric authentication.Apply preprocessing techniques like normalization and feature scaling.Train and evaluate deep learning or ensemble models for gait recognition.5. Citation & LicenseIf you use this dataset, please cite it as follows:Sandeep Ravikanti, "Metaverse Gait Authentication Dataset (MGAD)," IEEE DataPort, 2025. DOI: https://dx.doi.org/10.21227/rvh5-88426. Contact InformationFor inquiries or collaborations, please contact: bitsrmit2023@gmail.com
  14. Z

    Multimodal Vision-Audio-Language Dataset

    • data.niaid.nih.gov
    • zenodo.org
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_10060784
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Goethe University Frankfurt
    Authors
    Schaumlöffel, Timothy; Roig, Gemma; Choksi, Bhavin
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities. Details can be found in the attached report. Annotation The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library. The split into train, validation and test set follows the split of the original datasets. Installation

    pip install pandas pyarrow Example

    import pandas as pddf = pd.read_parquet('annotation_train.parquet', engine='pyarrow')print(df.iloc[0])

    dataset AudioSet filename train/---2_BBVHAA.mp3 captions_visual [a man in a black hat and glasses.] captions_auditory [a man speaks and dishes clank.] tags [Speech] Description The annotation file consists of the following fields:filename: Name of the corresponding file (video or audio file)dataset: Source dataset associated with the data pointcaptions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual contentcaptions_auditory: A list of captions related to the auditory content of the videotags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided Data files The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  15. ProVision-10M

    • huggingface.co
    Updated Nov 5, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Salesforce (2025). ProVision-10M [Dataset]. https://huggingface.co/datasets/Salesforce/ProVision-10M
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Nov 5, 2025
    Dataset provided by
    Salesforce Inchttp://salesforce.com/
    Authors
    Salesforce
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

    ProVision is an extendable data generation engine which produces instruction data for large multimodal language models (MLMs). In particular, it synthesizes instruction data via data generators (Python programs) and scene graphs rather than proprietary models. It also includes a scene graph generation pipeline consisting of various state-of-the-art models (eg, object detection model). Thus… See the full description on the dataset page: https://huggingface.co/datasets/Salesforce/ProVision-10M.

  16. facebook/natural_reasoning

    • kaggle.com
    zip
    Updated Feb 27, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Zehra Korkusuz (2025). facebook/natural_reasoning [Dataset]. https://www.kaggle.com/datasets/zehrakorkusuz/natural-reasoning
    Explore at:
    zip(1694591016 bytes)Available download formats
    Dataset updated
    Feb 27, 2025
    Authors
    Zehra Korkusuz
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    Natural Reasoning Dataset

    Source: Huggingface

    Dataset Overview

    Natural Reasoning is a large-scale dataset designed for general reasoning tasks. It consists of high-quality, challenging reasoning questions backtranslated from pretraining corpora DCLM and FineMath. The dataset has been carefully deduplicated and decontaminated from popular reasoning benchmarks including MATH, GPQA, MMLU-Pro, and MMLU-STEM.

    A 1.1 million subset of the Natural Reasoning dataset is released to the research community to foster the development of strong large language model (LLM) reasoners.

    Dataset Information

    File Format: natural_reasoning.parquet

    Click here to view the dataset

    How to Use

    You can load the dataset directly from Hugging Face as follows:

    from datasets import load_dataset
    
    ds = load_dataset("facebook/natural_reasoning")
    

    Data Collection and Quality

    The dataset was constructed from the pretraining corpora DCLM and FineMath. The questions have been filtered to remove contamination and duplication from widely-used reasoning benchmarks like MATH, GPQA, MMLU-Pro, and MMLU-STEM. For each question, the dataset provides a reference final answer extracted from the original document when available, and also includes a model-generated response from Llama3.3-70B-Instruct.

    Reference Answer Statistics

    In the 1.1 million subset: - 18.29% of the questions do not have a reference answer. - 9.71% of the questions have a single-word answer. - 21.58% of the questions have a short answer. - 50.42% of the questions have a long-form reference answer.

    Scaling Curve Performance

    Training on the Natural Reasoning dataset shows superior scaling effects compared to other datasets. When training the Llama3.1-8B-Instruct model, the dataset achieved better performance on average across three key benchmarks: MATH, GPQA, and MMLU-Pro.

    https://cdn-uploads.huggingface.co/production/uploads/659a395421a7431643caedda/S6aO-agjRRhc0JLkohZ5z.jpeg" alt="Scaling Curve">

    Citation

    If you use the Natural Reasoning dataset, please cite it with the following BibTeX entry:

    @misc{yuan2025naturalreasoningreasoningwild28m,
       title={NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions},
       author={Weizhe Yuan and Jane Yu and Song Jiang and Karthik Padthe and Yang Li and Dong Wang and Ilia Kulikov and Kyunghyun Cho and Yuandong Tian and Jason E Weston and Xian Li},
       year={2025},
       eprint={2502.13124},
       archivePrefix={arXiv},
       primaryClass={cs.CL},
       url={https://arxiv.org/abs/2502.13124}
    }
    

    Source: Hugging Face

  17. Code4ML 2.0

    • zenodo.org
    csv, txt
    Updated May 19, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Anonimous authors; Anonimous authors (2025). Code4ML 2.0 [Dataset]. http://doi.org/10.5281/zenodo.15465737
    Explore at:
    csv, txtAvailable download formats
    Dataset updated
    May 19, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Anonimous authors; Anonimous authors
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This is an enriched version of the Code4ML dataset, a large-scale corpus of annotated Python code snippets, competition summaries, and data descriptions sourced from Kaggle. The initial release includes approximately 2.5 million snippets of machine learning code extracted from around 100,000 Jupyter notebooks. A portion of these snippets has been manually annotated by human assessors through a custom-built, user-friendly interface designed for this task.

    The original dataset is organized into multiple CSV files, each containing structured data on different entities:

    • code_blocks.csv: Contains raw code snippets extracted from Kaggle.
    • kernels_meta.csv: Metadata for the notebooks (kernels) from which the code snippets were derived.
    • competitions_meta.csv: Metadata describing Kaggle competitions, including information about tasks and data.
    • markup_data.csv: Annotated code blocks with semantic types, allowing deeper analysis of code structure.
    • vertices.csv: A mapping from numeric IDs to semantic types and subclasses, used to interpret annotated code blocks.

    Table 1. code_blocks.csv structure

    ColumnDescription
    code_blocks_indexGlobal index linking code blocks to markup_data.csv.
    kernel_idIdentifier for the Kaggle Jupyter notebook from which the code block was extracted.
    code_block_id

    Position of the code block within the notebook.

    code_block

    The actual machine learning code snippet.

    Table 2. kernels_meta.csv structure

    ColumnDescription
    kernel_idIdentifier for the Kaggle Jupyter notebook.
    kaggle_scorePerformance metric of the notebook.
    kaggle_commentsNumber of comments on the notebook.
    kaggle_upvotesNumber of upvotes the notebook received.
    kernel_linkURL to the notebook.
    comp_nameName of the associated Kaggle competition.

    Table 3. competitions_meta.csv structure

    ColumnDescription
    comp_nameName of the Kaggle competition.
    descriptionOverview of the competition task.
    data_typeType of data used in the competition.
    comp_typeClassification of the competition.
    subtitleShort description of the task.
    EvaluationAlgorithmAbbreviationMetric used for assessing competition submissions.
    data_sourcesLinks to datasets used.
    metric typeClass label for the assessment metric.

    Table 4. markup_data.csv structure

    ColumnDescription
    code_blockMachine learning code block.
    too_longFlag indicating whether the block spans multiple semantic types.
    marksConfidence level of the annotation.
    graph_vertex_idID of the semantic type.

    The dataset allows mapping between these tables. For example:

    • code_blocks.csv can be linked to kernels_meta.csv via the kernel_id column.
    • kernels_meta.csv is connected to competitions_meta.csv through comp_name. To maintain quality, kernels_meta.csv includes only notebooks with available Kaggle scores.

    In addition, data_with_preds.csv contains automatically classified code blocks, with a mapping back to code_blocks.csvvia the code_blocks_index column.

    Code4ML 2.0 Enhancements

    The updated Code4ML 2.0 corpus introduces kernels extracted from Meta Kaggle Code. These kernels correspond to the kaggle competitions launched since 2020. The natural descriptions of the competitions are retrieved with the aim of LLM.

    Notebooks in kernels_meta2.csv may not have a Kaggle score but include a leaderboard ranking (rank), providing additional context for evaluation.

    competitions_meta_2.csv is enriched with data_cards, decsribing the data used in the competitions.

    Applications

    The Code4ML 2.0 corpus is a versatile resource, enabling training and evaluation of models in areas such as:

    • Code generation
    • Code understanding
    • Natural language processing of code-related tasks
  18. T

    wikihow

    • tensorflow.org
    • opendatalab.com
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

    There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

    There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

    Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikihow', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. s

    MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE)...

    • figshare.scilifelab.se
    • researchdata.se
    • +1more
    zip
    Updated Jan 15, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl (2025). MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) protocol data and codes [Dataset]. http://doi.org/10.17044/scilifelab.28008872.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    Jan 15, 2025
    Dataset provided by
    Uppsala University
    Authors
    Mikhail Panfilov; Guanzhong Mao; Jianfeng Guo; Javier Aguirre Rivera; Anton Sabantcev; Sebastian Deindl
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    A test dataset for MUSCLE (MUltiplexed Single-molecule Characterization at the Library scalE) data analysis. See "\Python codes for MUSCLE data analysis\README.txt" for the instructions on running the data analysis codes. Use the files in the "Test MUSCLE dataset" folder as input for the codes. "Test MUSCLE dataset\Output_tile1" contains the code output for the test dataset. The example dataset corresponds to one MiSeq tile in an experiment analyzing dCas9-induced R-loop formation for a library of 256 different target sequences.The latest version of the Python codes for matching single-molecule FRET traces with sequenced clusters is available at https://github.com/deindllab/MUSCLE/.

  20. H

    Data from: Critical Search: A procedure for guided reading in large-scale...

    • dataverse.harvard.edu
    Updated Jan 4, 2019
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jo Guldi (2019). Critical Search: A procedure for guided reading in large-scale textual corpora [Dataset]. http://doi.org/10.7910/DVN/BJNAPD
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jan 4, 2019
    Dataset provided by
    Harvard Dataverse
    Authors
    Jo Guldi
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    This dataset contains full-scale visualizations as well as original data and code (in R and Python) to reproduce the figures and tables for "Critical Search." The data includes full-text data for the Hansard debates, and the code employs keyword search, topic modeling, and KL measurement.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
abhishek (2020). Feature Scaling using Python [Dataset]. https://www.kaggle.com/datasets/abhishekgupta31/feature-scaling-using-python
Organization logo

Feature Scaling using Python

Using Standard Scaler and Mni max Scaler

Explore at:
zip(135359 bytes)Available download formats
Dataset updated
Nov 25, 2020
Authors
abhishek
Description

Dataset

This dataset was created by abhishek

Contents

Search
Clear search
Close search
Google apps
Main menu