46 datasets found
  1. Data Cleaning, Translation & Split of the Dataset for the Automatic...

    • zenodo.org
    bin, csv +1
    Updated Apr 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
    Explore at:
    text/x-python, csv, binAvailable download formats
    Dataset updated
    Apr 24, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Juliane Köhler; Juliane Köhler
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
    • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
    • ger_train.csv – The German training set as CSV file.
    • ger_validation.csv – The German validation set as CSV file.
    • en_test.csv – The English test set as CSV file.
    • en_train.csv – The English training set as CSV file.
    • en_validation.csv – The English validation set as CSV file.
    • splitting.py – The python code for splitting a dataset into train, test and validation set.
    • DataSetTrans_de.csv – The final German dataset as a CSV file.
    • DataSetTrans_en.csv – The final English dataset as a CSV file.
    • translation.py – The python code for translating the cleaned dataset.
  2. Multimodal Vision-Audio-Language Dataset

    • zenodo.org
    • data.niaid.nih.gov
    Updated Jul 11, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
    Explore at:
    Dataset updated
    Jul 11, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.

    Details can be found in the attached report.

    Annotation

    The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.

    The split into train, validation and test set follows the split of the original datasets.

    Installation

    pip install pandas pyarrow

    Example

    import pandas as pd
    df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
    print(df.iloc[0])

    dataset AudioSet

    filename train/---2_BBVHAA.mp3

    captions_visual [a man in a black hat and glasses.]

    captions_auditory [a man speaks and dishes clank.]

    tags [Speech]

    Description

    The annotation file consists of the following fields:

    filename: Name of the corresponding file (video or audio file)
    dataset: Source dataset associated with the data point
    captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
    captions_auditory: A list of captions related to the auditory content of the video
    tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided

    Data files

    The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de

  3. Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

    • zenodo.org
    • data.europa.eu
    zip
    Updated Aug 24, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
    Explore at:
    zipAvailable download formats
    Dataset updated
    Aug 24, 2021
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description
    • The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.
    • The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.
    • All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.
    • The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.
    • Notable changes to each version of the dataset are documented in CHANGELOG.md.
  4. a

    Street View House Numbers

    • datasets.activeloop.ai
    deeplake
    Updated Feb 3, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Google, Stanford University (2022). Street View House Numbers [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/the-street-view-house-numbers-svhn-dataset/
    Explore at:
    deeplakeAvailable download formats
    Dataset updated
    Feb 3, 2022
    Dataset authored and provided by
    Google, Stanford University
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Dataset funded by
    Google, Stanford University
    Description

    The Street View House Numbers (SVHN) dataset is a dataset of 604,300 images of house numbers taken from Google Street View. The dataset is split into a training set of 73,257 images, a test set of 26,032 images, and a validation set of 50,113 images. The images in the dataset are all 32 x 32 pixels in size and are in grayscale. The dataset is used to train and evaluate machine learning models for the task of digit recognition.

  5. h

    hate_speech_dataset

    • huggingface.co
    Updated Jul 27, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christina Christodoulou (2024). hate_speech_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/hate_speech_dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Jul 27, 2024
    Authors
    Christina Christodoulou
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    32.579 texts in total, 14.012 NOT hateful texts and 18.567 HATEFUL texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 11.210, 1 ==> 14.853, 26.063 in total Validation set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in total Test set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/hate_speech_dataset.

  6. T

    ref_coco

    • tensorflow.org
    • opendatalab.com
    Updated May 31, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco
    Explore at:
    Dataset updated
    May 31, 2024
    Description

    A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

    RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

    Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

    Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

    datasetpartitionsplitrefsimages
    refcocogoogletrain4000019213
    refcocogoogleval50004559
    refcocogoogletest50004527
    refcocounctrain4240416994
    refcocouncval38111500
    refcocounctestA1975750
    refcocounctestB1810750
    refcoco+unctrain4227816992
    refcoco+uncval38051500
    refcoco+unctestA1975750
    refcoco+unctestB1798750
    refcocoggoogletrain4482224698
    refcocoggoogleval50004650
    refcocogumdtrain4222621899
    refcocogumdval25731300
    refcocogumdtest50232600

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('ref_coco', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

  7. t

    Tour Recommendation Model

    • test.researchdata.tuwien.at
    bin, png +1
    Updated May 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
    Explore at:
    text/markdown, png, binAvailable download formats
    Dataset updated
    May 14, 2025
    Dataset provided by
    TU Wien
    Authors
    Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 28, 2025
    Description

    Dataset Description for Tour Recommendation Model

    Context and Methodology:

    • Research Domain/Project:
      This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

    • Purpose:
      The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

    • Creation Methodology:
      The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

    Technical Details:

    • Structure of the Dataset:
      The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

      • place_or_event_id: Unique identifier for each tourist place or event.

      • rating: Rating given by the user, ranging from 1 to 5.

      The data is split into three subsets:

      • Training Set: 80% of the dataset used to train the model.

      • Validation Set: A small portion used for hyperparameter tuning.

      • Test Set: 20% used to evaluate model performance.

    • Folder and File Naming Conventions:
      The dataset files are stored in the following structure:

      • user_ratings_dataset.csv: The original dataset file containing user ratings.

      • tour_recommendation_model.pkl: The saved model after training.

      • actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

    • Software Requirements:
      To open and work with this dataset, the following software and libraries are required:

      • Python 3.x

      • Pandas for data manipulation

      • Scikit-learn for training and evaluating machine learning models

      • Matplotlib for chart generation

      • Joblib for saving and loading the trained model

      The dataset can be opened and processed using any Python environment that supports these libraries.

    • Additional Resources:

      • The model training code, README file, and performance chart are available in the project repository.

      • For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

    Further Details:

    • Dataset Reusability:
      The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

      • Train other types of models (e.g., regression, classification).

      • Experiment with different features or add more metadata to enrich the dataset.

    • Data Integrity:
      The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

    • Licensing:
      The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.

  8. o

    Rescaled CIFAR-10 dataset

    • explore.openaire.eu
    • zenodo.org
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Andrzej Perzanowski; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
    Explore at:
    Dataset updated
    Apr 10, 2025
    Authors
    Andrzej Perzanowski; Tony Lindeberg
    Description

    Motivation The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data. The Rescaled CIFAR-10 dataset was introduced in the paper: [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear. with a pre-print available at arXiv: [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140. Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in: [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2 and is therefore significantly more challenging. Access and rights The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset: [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto. and also for this new rescaled version, using the reference [1] above. The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible. The dataset The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1]. There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9]. The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set. The h5 files containing the dataset The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is: cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5 Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]: cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5 These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1]. Instructions for loading the data set The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used. The training dataset can be loaded in Python as: with h5py.File(``, 'r') as f: x_train = np.array( f["/x_train"], dtype=np.float32) x_val = np.array( f["/x_val"], dtype=np.float32) x_test = np.array( f["/x_test"], dtype=np.float32) y_train = np.array( f["/y_train"], dtype=np.int32) y_val = np.array( f["/y_val"], dtype=np.int32) y_test = np.array( f["/y_test"], dtype=np.int32) We also need to permute the data, since Pytorch uses the format [num_samples, channels, width...

  9. T

    wikihow

    • tensorflow.org
    • opendatalab.com
    • +1more
    Updated Dec 6, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

    There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

    There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

    Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('wikihow', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  10. cinic-10 Lance Dataset

    • kaggle.com
    Updated Apr 25, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Vipul Maheshwari (2024). cinic-10 Lance Dataset [Dataset]. https://www.kaggle.com/datasets/vipulmaheshwarii/cinic-10-lance-dataset
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 25, 2024
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Vipul Maheshwari
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    🔴 NOTE : USE THE VERSION 2

    This is the CINIC-10 dataset's train, validation, and test splits saved in the Lance file format for blazing fast and memory-efficient I/O. This dataset only includes data necessary for image classification tasks.

    1. For detailed information on how the dataset was created, refer to the paper describing the CINIC-10 dataset
    2. For instructions on how to create Lance data for any image dataset, check out this detailed blog post which shows how you can use a single script for any image dataset and convert it to lance format

    Instructions for using this dataset

    This dataset is provided as a single zip file containing the Lance-formatted data for the train, validation, and test splits. To use this dataset, follow these steps:

    Instructions for using this dataset 1. To use this dataset, you must download it through this page, and then move the unzipped files to a relevant folder. 2. Now, in your code, you can use the datasets by creating LanceDataset objects and passing the respective paths

    import lance
    train_lance = lance.dataset('cinic/cinic_train.lance')
    test_lance = lance.dataset('cinic/cinic_test.lance')
    val_lance = lance.dataset('cinic/cinic_val.lance')
    

    Note that the Lance file format provides blazing-fast and memory-efficient I/O, allowing you to work with large datasets without running into memory issues. Refer to the documentation for more information on how to use the Lance library.

  11. r

    Data from: JSON Dataset of Simulated Building Heat Control for System of...

    • researchdata.se
    • gimi9.com
    Updated Mar 21, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jacob Nilsson (2025). JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. http://doi.org/10.5878/e5hb-ne80
    Explore at:
    (438755370), (110041420), (156812), (5417)Available download formats
    Dataset updated
    Mar 21, 2025
    Dataset provided by
    Luleå University of Technology
    Authors
    Jacob Nilsson
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Area covered
    Luleå Municipality
    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation.

    The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data.

    The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

    The data file with temperatures (smhi-july-23-29-2018.csv) acts as input for the thermodynamic building simulation found on Github, where it is used to get the outside temperature and corresponding timestamps. Temperature data for Luleå Summer 2018 were downloaded from SMHI.

  12. n

    Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

    • narcis.nl
    • data.mendeley.com
    Updated Jan 11, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
    Explore at:
    Dataset updated
    Jan 11, 2021
    Dataset provided by
    Data Archiving and Networked Services (DANS)
    Authors
    Yoo, T (via Mendeley Data)
    Description

    Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

    We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

    This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

    This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

    Python version:

    from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

    connect data in your google drive

    from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

    Change the path for the custom data

    In this case, we used ICL vault prediction using preop measurement

    dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

    optimal features (sorted by importance) :

    1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

    7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

    y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

    Split the dataset to train and test data, if necessary.

    For example, we can split data to 8:2 as a simple validation test

    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

    In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

    Optimal parameter search could be performed in this section

    parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

    RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_

  13. T

    speech_commands

    • tensorflow.org
    • datasets.activeloop.ai
    • +1more
    Updated Jan 13, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
    Explore at:
    Dataset updated
    Jan 13, 2023
    Description

    An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('speech_commands', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  14. h

    cf-cpp-to-python-code-generation

    • huggingface.co
    Updated Jul 20, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Hesam Haddad (2025). cf-cpp-to-python-code-generation [Dataset]. https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation
    Explore at:
    Dataset updated
    Jul 20, 2025
    Authors
    Hesam Haddad
    Description

    Dataset

    The cf-llm-finetune uses a synthetic parallel dataset built from the Codeforces submissions and problems. C++ ICPC-style solutions are filtered, cleaned, and paired with problem statements to generate Python translations using GPT-4.1, creating a fine-tuning dataset for code translation. The final dataset consists of C++ solutions from 2,000 unique problems, and synthetic Python answers, split into train (1,400), validation (300), and test (300) sets. For details on dataset… See the full description on the dataset page: https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation.

  15. e

    Deep learning based 3d reconstruction for phenotyping of wheat seeds:...

    • b2find.eudat.eu
    Updated Aug 23, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). Deep learning based 3d reconstruction for phenotyping of wheat seeds: dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/80efe038-164b-50c8-ac8d-19605da5d4ea
    Explore at:
    Dataset updated
    Aug 23, 2023
    Description

    We present a new data set for 3d wheat seed reconstruction, propose a challenge “Wheat Seed 3d Reconstruction Challenge”, and provide baseline methods [1]. The dataset consists of 2964 seeds, split into 2520 seeds for training/validation and 444 for testing. Ground truth data for the test set is not provided, however, test results can be evaluated in the “Wheat Seed 3d Reconstruction Challenge” on https://helmholtz-data-challenges.de/. Per seed there are: (1) Point cloud, reconstructed from 36 images from turntable setup (2) 36 projection matrices, projecting point cloud (2) back to corresponding images from turntable setup (3) Top view taken from another camera We provide raw (1) and preprocessed (5) versions of point clouds, and preprocessed versions of images from turntable setup (4). Raw images from turntable setup are available in the other data record on https://b2share.fz-juelich.de/ "Deep learning based 3d reconstruction for phenotyping of wheat seeds: dataset (with raw images)" (http://doi.org/10.34730/3541bf71388946b3a3a906fef7aed491). ======================= Detailed content: (1) raw_point_clouds/ -- .ply -- N points, N<70000, x-y-z coordinates, ascii format (2) projection_matrices/ -- .txt -- 36x4x4 floats (3) 2d_station_images/ -- .png -- 700x700, 24 bit, RGB color (4) preprocessed_3d_station_images/ -- .tif -- stabilized and cropped to 373x200, 8 bit, grayscale (5) preprocessed_gt_point_clouds/ --.ply – 2000 points, x-y-z coordinates, ascii format (6) general_files/ -- additional files, files for competition submission (1) consists of point clouds, one per seed. (5) consists of point clouds in 36 corresponding poses per seed. Raw 3d station images from turntable setup were preprocessed, stabilized and resampled, which results in (4). Raw point clouds (1) were preprocessed (resampled) such that each of them contains 2000 points, lying in the fixed set of directions /general_files/directions.csv. Preprocessed point clouds (5) are convertible to triangular mesh using indices of vertex triplets /general_files/triangles.txt. A sphere sampled with these vertices and triangles is in /general_files/fibo_msh.ply. The test set does not include point clouds, and has 3 views per seed: 0, 120 and 240 degrees. Zero-padded integers XXXX are seed indices (here depicted 0000 and 0003). Zero-padded integers YYY in rotation_YYY are rotation angles in degrees: 0, 10, .. 350 degrees. Indices of train and test sets are provided in general_files/indices_train.txt and general_files/indices_test.txt files. Volumes of raw point clouds of the train set are located in general_files/train_gt_volumes.csv. Point cloud files .ply (1) and (5) were created with open3d python library, the header of the file is: format ascii 1.0 element vertex ????? property float x property float y property float z end_header These .ply are visualizable with, e.g., MeshLab (https://www.meshlab.net/) in windows. There are two files in this data record: seed_dataset.zip and numpy_arrays.zip. The file seed_dataset.zip contains the data, where each file is presented separately. The file numpy_arrays.zip contains numerical arrays, aggregating these separate files into multidimensional arrays readable with the python library numpy as np.load('file_name.npy'). Structure of the data after extraction of the numpy_arrays.zip into corresponding directories:

  16. h

    codeparrot-valid-more-filtering

    • huggingface.co
    Updated Apr 27, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    CodeParrot (2022). codeparrot-valid-more-filtering [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 27, 2022
    Dataset provided by
    Good Engineering, Inc
    Authors
    CodeParrot
    Description

    CodeParrot 🦜 Dataset Cleaned and filtered (validation)

      Dataset Description
    

    A dataset of Python files from Github. It is a more filtered version of the validation split codeparrot-clean-valid of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:

    files with a mention of "test file" or "configuration… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering.

  17. STEAD subsample 4 CDiffSD

    • zenodo.org
    bin
    Updated Apr 30, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
    Explore at:
    binAvailable download formats
    Dataset updated
    Apr 30, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Daniele Trappolini; Daniele Trappolini
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Time period covered
    Apr 15, 2024
    Description

    STEAD Subsample Dataset for CDiffSD Training

    Overview

    This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

    Dataset Files

    The dataset includes the following files:

    • train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.
    • noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.
    • test: Used for the testing phase, structured similarly to train.
    • noise_test: Used for the testing phase, contains noise data for testing.

    Each file is structured to support the training and evaluation of seismic denoising models.

    Data

    The HDF5 files named noise contain two main datasets:

    • traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).
    • metadata: This dataset contains the names of the traces for each event.

    Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

    • p_arrival: Contains the arrival indices of P-waves, expressed in counts.
    • s_arrival: Contains the arrival indices of S-waves, also expressed in counts.


    Usage

    To load these files in a Python environment, use the following approach:

    ```python

    import h5py
    import numpy as np

    # Open the HDF5 file in read mode
    with h5py.File('train_noise.hdf5', 'r') as file:
    # Print all the main keys in the file
    print("Keys in the HDF5 file:", list(file.keys()))

    if 'traces' in file:
    # Access the dataset
    data = file['traces'][:10] # Load the first 10 traces

    if 'metadata' in file:
    # Access the dataset
    trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

    Ensure that the path to the file is correctly specified relative to your Python script.

    Requirements

    To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

    ```bash
    pip install numpy
    pip install h5py
    ```

  18. T

    Data from: dtd

    • tensorflow.org
    Updated Dec 6, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). dtd [Dataset]. https://www.tensorflow.org/datasets/catalog/dtd
    Explore at:
    Dataset updated
    Dec 6, 2022
    Description

    The Describable Textures Dataset (DTD) is an evolving collection of textural images in the wild, annotated with a series of human-centric attributes, inspired by the perceptual properties of textures. This data is made available to the computer vision community for research purposes.

    The "label" of each example is its "key attribute" (see the official website). The official release of the dataset defines a 10-fold cross-validation partition. Our TRAIN/TEST/VALIDATION splits are those of the first fold.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('dtd', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/dtd-3.0.1.png" alt="Visualization" width="500px">

  19. e

    JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...

    • b2find.eudat.eu
    Updated Apr 19, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). JSON dataset för simulerad byggnadsvärmekontroll för system-av-system interoperabilitet - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/442bc87f-092d-57d9-a2f0-ba1c7e049d36
    Explore at:
    Dataset updated
    Apr 19, 2022
    Description

    Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/. Datasetet innehåller simulerad servicedata för system-av-system interoperabilitetsforskning. För mer information se bifogad dokumentation och den engelska katalogsidan. Data kommer i två semikolonseparerade (;) csv-filer, training.csv och test.csv. Träning/testfördelningen är inte slumpmässig; träningsdata kommer från de första 80 % av de simulerade tidsstegen och testdata är de sista 20 %. Det finns ingen specifik valideringsdatauppsättning, valideringsdatan bör istället väljas slumpmässigt från träningsdatan. Simuleringen körs i lika många tidssteg som det finns tillgängliga utetemperaturvärden. De ursprungliga SMHI-data samplar bara en gång i timmen, som linjärt interpolerar för att få ett temperaturprov var tionde sekund. Data som sparas vid varje tidssteg består av 34 JSON-meddelanden (fyra per rum och två temperaturavläsningar utifrån), 9 temperaturvärden (ett per rum och utanför), 8 börvärden och 8 ställdonutgångar. Data som är associerade med vart och ett av dessa 34 JSON-meddelanden lagras som en enda rad i tabellerna. Detta innebär att mycket data dupliceras, ett val som görs för att göra det lättare att använda datan. Simuleringsdata är inte avsedd att öppnas och analyseras i kalkylprogram, det är avsett att träna maskininlärningsmodeller. Det rekommenderas att öppna data med pandas-biblioteket för Python, tillgängligt på https://pypi.org/project/pandas/. Building temperature simulation. Simulering av byggnadstemperatur. Simulation

  20. Astur Apple image Dataset

    • zenodo.org
    • portalinvestigacion.uniovi.es
    • +1more
    bin, txt
    Updated Oct 12, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Menendez Díaz, Agustín,; Menendez Díaz, Agustín,; Silverio García-Cortés; Silverio García-Cortés; José Alberto Oliveira Prendes; José Alberto Oliveira Prendes; Antonio Bello-García; Antonio Bello-García (2022). Astur Apple image Dataset [Dataset]. http://doi.org/10.5281/zenodo.7188796
    Explore at:
    txt, binAvailable download formats
    Dataset updated
    Oct 12, 2022
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Menendez Díaz, Agustín,; Menendez Díaz, Agustín,; Silverio García-Cortés; Silverio García-Cortés; José Alberto Oliveira Prendes; José Alberto Oliveira Prendes; Antonio Bello-García; Antonio Bello-García
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Struture of file: 2022-v2-9classes-AsturApple-Balanced-SIZE224-train-dev-test.hdf5
    -------------------------------------------------------------------------------------------
    This file contains dataset of 6108 cider apple color images, 224x224 pixels to support the article "Transfer learning with convolutional neural networks
    for classification of cider apple varieties" results.

    -The images belong to nine apple classes: 'BLANQUINA' 'CARRIO' 'FLORINA' 'FUENTES' 'PRIETA' 'RAXAO' 'REINETA ENCARNADA' 'REINETA PINTA' 'REINETA ROJA DEL CANADA'
    -Full dataset is split in 4886 images for training, 611 for testing and 611 for validation.
    -Class labels are codified as one-hot enconding binary labels. (e.g. [0 0 0 0 0 1 0] ---> Reineta Pinta)
    -Training image set are store as tensor "trainX": (4168,224,224,3)
    -Training class labels "trainY": (4886,7)
    -Test image set tensor "testX": (611, 224,224,3)
    -Test image binary class labels "testY": (611, 7)
    -Validation image set tensor "devX": (611, 224,224,3)
    -Validation binary class labels "devY": (611, 7)

    * file was created with h5py module in Python.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Organization logo

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:
text/x-python, csv, binAvailable download formats
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description
  • Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
  • Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
  • ger_train.csv – The German training set as CSV file.
  • ger_validation.csv – The German validation set as CSV file.
  • en_test.csv – The English test set as CSV file.
  • en_train.csv – The English training set as CSV file.
  • en_validation.csv – The English validation set as CSV file.
  • splitting.py – The python code for splitting a dataset into train, test and validation set.
  • DataSetTrans_de.csv – The final German dataset as a CSV file.
  • DataSetTrans_en.csv – The final English dataset as a CSV file.
  • translation.py – The python code for translating the cleaned dataset.
Search
Clear search
Close search
Google apps
Main menu