46 datasets found

Data Cleaning, Translation & Split of the Dataset for the Automatic...
zenodo.org
bin, csv +1
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842
Explore at:
text/x-python, csv, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6957842
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Juliane Köhler; Juliane Köhler
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.

Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.

ger_train.csv – The German training set as CSV file.

ger_validation.csv – The German validation set as CSV file.

en_test.csv – The English test set as CSV file.

en_train.csv – The English training set as CSV file.

en_validation.csv – The English validation set as CSV file.

splitting.py – The python code for splitting a dataset into train, test and validation set.

DataSetTrans_de.csv – The final German dataset as a CSV file.

DataSetTrans_en.csv – The final English dataset as a CSV file.

translation.py – The python code for translating the cleaned dataset.
Multimodal Vision-Audio-Language Dataset
zenodo.org
data.niaid.nih.gov
Updated Jul 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi (2024). Multimodal Vision-Audio-Language Dataset [Dataset]. http://doi.org/10.5281/zenodo.10060785
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.10060785
Dataset updated
Jul 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Timothy Schaumlöffel; Timothy Schaumlöffel; Gemma Roig; Gemma Roig; Bhavin Choksi; Bhavin Choksi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The Multimodal Vision-Audio-Language Dataset is a large-scale dataset for multimodal learning. It contains 2M video clips with corresponding audio and a textual description of the visual and auditory content. The dataset is an ensemble of existing datasets and fills the gap of missing modalities.
Details can be found in the attached report.
Annotation
The annotation files are provided as Parquet files. They can be read using Python and the pandas and pyarrow library.
The split into train, validation and test set follows the split of the original datasets.
Installation
pip install pandas pyarrow
Example
import pandas as pd
df = pd.read_parquet('annotation_train.parquet', engine='pyarrow')
print(df.iloc[0])
dataset AudioSet
filename train/---2_BBVHAA.mp3
captions_visual [a man in a black hat and glasses.]
captions_auditory [a man speaks and dishes clank.]
tags [Speech]
Description
The annotation file consists of the following fields:

filename: Name of the corresponding file (video or audio file)
dataset: Source dataset associated with the data point
captions_visual: A list of captions related to the visual content of the video. Can be NaN in case of no visual content
captions_auditory: A list of captions related to the auditory content of the video
tags: A list of tags, classifying the sound of a file. It can be NaN if no tags are provided
Data files
The raw data files for most datasets are not released due to licensing issues. They must be downloaded from the source. However, due to missing files, we provide them on request. Please contact us at schaumloeffel@em.uni-frankfurt.de
Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...
zenodo.org
data.europa.eu
zip
Updated Aug 24, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios (2021). ManyTypes4Py: A benchmark Python Dataset for Machine Learning-Based Type Inference [Dataset]. http://doi.org/10.5281/zenodo.4571228
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.4571228
Dataset updated
Aug 24, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Amir M. Mir; Amir M. Mir; Evaldas Latoskinas; Georgios Gousios; Evaldas Latoskinas; Georgios Gousios
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The dataset is gathered on Sep. 17th 2020. It has more than 5.4K Python repositories that are hosted on GitHub. Check out the file ManyTypes4PyDataset.spec for repositories URL and their commit SHA.

The dataset is also de-duplicated using the CD4Py tool. The list of duplicate files is provided in duplicate_files.txt file.

All of its Python projects are processed in JSON-formatted files. They contain a seq2seq representation of each file, type-related hints, and information for machine learning models. The structure of JSON-formatted files is described in JSONOutput.md file.

The dataset is split into train, validation and test sets by source code files. The list of files and their corresponding set is provided in dataset_split.csv file.

Notable changes to each version of the dataset are documented in CHANGELOG.md.
a
Street View House Numbers
datasets.activeloop.ai
deeplake
Updated Feb 3, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Google, Stanford University (2022). Street View House Numbers [Dataset]. https://datasets.activeloop.ai/docs/ml/datasets/the-street-view-house-numbers-svhn-dataset/
Explore at:
deeplakeAvailable download formats
Dataset updated
Feb 3, 2022
Dataset authored and provided by
Google, Stanford University
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Google, Stanford University
Description
The Street View House Numbers (SVHN) dataset is a dataset of 604,300 images of house numbers taken from Google Street View. The dataset is split into a training set of 73,257 images, a test set of 26,032 images, and a validation set of 50,113 images. The images in the dataset are all 32 x 32 pixels in size and are in grayscale. The dataset is used to train and evaluate machine learning models for the task of digit recognition.
h
hate_speech_dataset
huggingface.co
Updated Jul 27, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christina Christodoulou (2024). hate_speech_dataset [Dataset]. https://huggingface.co/datasets/christinacdl/hate_speech_dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Jul 27, 2024
Authors
Christina Christodoulou
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
32.579 texts in total, 14.012 NOT hateful texts and 18.567 HATEFUL texts All duplicate values were removed Split using sklearn into 80% train and 20% temporary test (stratified label). Then split the test set using 0.50% test and validation (stratified label) Split: 80/10/10 Train set label distribution: 0 ==> 11.210, 1 ==> 14.853, 26.063 in total Validation set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in total Test set label distribution: 0 ==> 1.401, 1 ==> 1.857, 3.258 in… See the full description on the dataset page: https://huggingface.co/datasets/christinacdl/hate_speech_dataset.

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

t
Tour Recommendation Model
test.researchdata.tuwien.at
bin, png +1
Updated May 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar (2025). Tour Recommendation Model [Dataset]. http://doi.org/10.70124/akpf6-8p175
Explore at:
text/markdown, png, binAvailable download formats
Unique identifier
https://doi.org/10.70124/akpf6-8p175
Dataset updated
May 14, 2025
Dataset provided by
TU Wien
Authors
Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar; Muhammad Mobeel Akbar
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 28, 2025
Description
Dataset Description for Tour Recommendation Model

Context and Methodology:

Research Domain/Project:
This dataset is part of the Tour Recommendation System project, which focuses on predicting user preferences and ratings for various tourist places and events. It belongs to the field of Machine Learning, specifically applied to Recommender Systems and Predictive Analytics.

Purpose:
The dataset serves as the training and evaluation data for a Decision Tree Regressor model, which predicts ratings (from 1-5) for different tourist destinations based on user preferences. The model can be used to recommend places or events to users based on their predicted ratings.

Creation Methodology:
The dataset was originally collected from a tourism platform where users rated various tourist places and events. The data was preprocessed to remove missing or invalid entries (such as #NAME? in rating columns). It was then split into subsets for training, validation, and testing the model.

Technical Details:

Structure of the Dataset:
The dataset is stored as a CSV file (user_ratings_dataset.csv) and contains the following columns:

place_or_event_id: Unique identifier for each tourist place or event.

rating: Rating given by the user, ranging from 1 to 5.

The data is split into three subsets:

Training Set: 80% of the dataset used to train the model.

Validation Set: A small portion used for hyperparameter tuning.

Test Set: 20% used to evaluate model performance.

Folder and File Naming Conventions:
The dataset files are stored in the following structure:

user_ratings_dataset.csv: The original dataset file containing user ratings.

tour_recommendation_model.pkl: The saved model after training.

actual_vs_predicted_chart.png: A chart comparing actual and predicted ratings.

Software Requirements:
To open and work with this dataset, the following software and libraries are required:

Python 3.x

Pandas for data manipulation

Scikit-learn for training and evaluating machine learning models

Matplotlib for chart generation

Joblib for saving and loading the trained model

The dataset can be opened and processed using any Python environment that supports these libraries.

Additional Resources:

The model training code, README file, and performance chart are available in the project repository.

For detailed explanation and code, please refer to the GitHub repository (or any other relevant link for the code).

Further Details:

Dataset Reusability:
The dataset is structured for easy use in training machine learning models for recommendation systems. Researchers and practitioners can utilize it to:

Train other types of models (e.g., regression, classification).

Experiment with different features or add more metadata to enrich the dataset.

Data Integrity:
The dataset has been cleaned and preprocessed to remove invalid values (such as #NAME? or missing ratings). However, users should ensure they understand the structure and the preprocessing steps taken before reusing it.

Licensing:
The dataset is provided under the CC BY 4.0 license, which allows free usage, distribution, and modification, provided that proper attribution is given.
o
Rescaled CIFAR-10 dataset
explore.openaire.eu
zenodo.org
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Apr 10, 2025
Authors
Andrzej Perzanowski; Tony Lindeberg
Description
Motivation The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data. The Rescaled CIFAR-10 dataset was introduced in the paper: [1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear. with a pre-print available at arXiv: [2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140. Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in: [3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2 and is therefore significantly more challenging. Access and rights The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset: [4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto. and also for this new rescaled version, using the reference [1] above. The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible. The dataset The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1]. There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9]. The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set. The h5 files containing the dataset The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is: cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5 Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]: cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5 cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5 These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1]. Instructions for loading the data set The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used. The training dataset can be loaded in Python as: with h5py.File(``, 'r') as f: x_train = np.array( f["/x_train"], dtype=np.float32) x_val = np.array( f["/x_val"], dtype=np.float32) x_test = np.array( f["/x_test"], dtype=np.float32) y_train = np.array( f["/y_train"], dtype=np.int32) y_val = np.array( f["/y_val"], dtype=np.int32) y_test = np.array( f["/y_test"], dtype=np.int32) We also need to permute the data, since Pytorch uses the format [num_samples, channels, width...
T
wikihow
tensorflow.org
opendatalab.com
+1more
Updated Dec 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wikihow [Dataset]. https://www.tensorflow.org/datasets/catalog/wikihow
Explore at:
Dataset updated
Dec 6, 2022
Description
WikiHow is a new large-scale dataset using the online WikiHow (http://www.wikihow.com/) knowledge base.

There are two features: - text: wikihow answers texts. - headline: bold lines as summary.

There are two separate versions: - all: consisting of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries. - sep: consisting of each paragraph and its summary.

Download "wikihowAll.csv" and "wikihowSep.csv" from https://github.com/mahnazkoupaee/WikiHow-Dataset and place them in manual folder https://www.tensorflow.org/datasets/api_docs/python/tfds/download/DownloadConfig. Train/validation/test splits are provided by the authors. Preprocessing is applied to remove short articles (abstract length < 0.75 article length) and clean up extra commas.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikihow', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
cinic-10 Lance Dataset
kaggle.com
Updated Apr 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vipul Maheshwari (2024). cinic-10 Lance Dataset [Dataset]. https://www.kaggle.com/datasets/vipulmaheshwarii/cinic-10-lance-dataset
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 25, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Vipul Maheshwari
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
🔴 NOTE : USE THE VERSION 2

This is the CINIC-10 dataset's train, validation, and test splits saved in the Lance file format for blazing fast and memory-efficient I/O. This dataset only includes data necessary for image classification tasks.

For detailed information on how the dataset was created, refer to the paper describing the CINIC-10 dataset

For instructions on how to create Lance data for any image dataset, check out this detailed blog post which shows how you can use a single script for any image dataset and convert it to lance format

Instructions for using this dataset

This dataset is provided as a single zip file containing the Lance-formatted data for the train, validation, and test splits. To use this dataset, follow these steps:

Instructions for using this dataset 1. To use this dataset, you must download it through this page, and then move the unzipped files to a relevant folder. 2. Now, in your code, you can use the datasets by creating LanceDataset objects and passing the respective paths

import lance train_lance = lance.dataset('cinic/cinic_train.lance') test_lance = lance.dataset('cinic/cinic_test.lance') val_lance = lance.dataset('cinic/cinic_val.lance')

Note that the Lance file format provides blazing-fast and memory-efficient I/O, allowing you to work with large datasets without running into memory issues. Refer to the documentation for more information on how to use the Lance library.
r
Data from: JSON Dataset of Simulated Building Heat Control for System of...
researchdata.se
gimi9.com
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jacob Nilsson (2025). JSON Dataset of Simulated Building Heat Control for System of Systems Interoperability [Dataset]. http://doi.org/10.5878/e5hb-ne80
Explore at:
(438755370), (110041420), (156812), (5417)Available download formats
Unique identifier
https://doi.org/10.5878/e5hb-ne80
Dataset updated
Mar 21, 2025
Dataset provided by
Luleå University of Technology
Authors
Jacob Nilsson
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Area covered
Luleå Municipality
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation.

The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data.

The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/.

The data file with temperatures (smhi-july-23-29-2018.csv) acts as input for the thermodynamic building simulation found on Github, where it is used to get the outside temperature and corresponding timestamps. Temperature data for Luleå Summer 2018 were downloaded from SMHI.
n
Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...
narcis.nl
data.mendeley.com
Updated Jan 11, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yoo, T (via Mendeley Data) (2021). Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics" [Dataset]. http://doi.org/10.17632/ffn745r57z.2
Explore at:
Unique identifier
https://doi.org/10.17632/ffn745r57z.2
Dataset updated
Jan 11, 2021
Dataset provided by
Data Archiving and Networked Services (DANS)
Authors
Yoo, T (via Mendeley Data)
Description
Prediction of Phakic Intraocular Lens Vault Using Machine Learning of Anterior Segment Optical Coherence Tomography Metrics. Authors: Kazutaka Kamiya, MD, PhD, Ik Hee Ryu, MD, MS, Tae Keun Yoo, MD, Jung Sub Kim MD, In Sik Lee, MD, PhD, Jin Kook Kim MD, Wakako Ando CO, Nobuyuki Shoji, MD, PhD, Tomofusa, Yamauchi, MD, PhD, Hitoshi Tabuchi, MD, PhD.

We hypothesize that machine learning of preoperative biometric data obtained by the As-OCT may be clinically beneficial for predicting the actual ICL vault. Therefore, we built the machine learning model using Random Forest to predict ICL vault after surgery.

This multicenter study comprised one thousand seven hundred forty-five eyes of 1745 consecutive patients (656 men and 1089 women), who underwent EVO ICL implantation (V4c and V5 Visian ICL with KS-AquaPORT) for the correction of moderate to high myopia and myopic astigmatism, and who completed at least a 1-month follow-up, at Kitasato University Hospital (Kanagawa, Japan), or at B&VIIT Eye Center (Seoul, Korea).

This data file (RFR_model(feature=12).mat) is the final trained random forest model for MATLAB 2020a.

Python version:

from sklearn.model_selection import train_test_split import pandas as pd import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor

connect data in your google drive

from google.colab import auth auth.authenticate_user() from google.colab import drive drive.mount('/content/gdrive')

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

dataset = pd.read_csv('gdrive/My Drive/ICL/data_icl.csv') dataset.head()

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

y = dataset['Vault_1M'] X = dataset.drop(['Vault_1M'], axis = 1)

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=0)

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

parameters = {'bootstrap': True, 'min_samples_leaf': 3, 'n_estimators': 500, 'criterion': 'mae' 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6, 'max_leaf_nodes': None}

RF_model = RandomForestRegressor(**parameters) RF_model.fit(train_X, train_y) RF_predictions = RF_model.predict(test_X) importance = RF_model.feature_importances_
T
speech_commands
tensorflow.org
datasets.activeloop.ai
+1more
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
Explore at:
Unique identifier
https://identifiers.org/arxiv:1804.03209
Dataset updated
Jan 13, 2023
Description
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
h
cf-cpp-to-python-code-generation
huggingface.co
Updated Jul 20, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hesam Haddad (2025). cf-cpp-to-python-code-generation [Dataset]. https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation
Explore at:
Dataset updated
Jul 20, 2025
Authors
Hesam Haddad
Description
Dataset

The cf-llm-finetune uses a synthetic parallel dataset built from the Codeforces submissions and problems. C++ ICPC-style solutions are filtered, cleaned, and paired with problem statements to generate Python translations using GPT-4.1, creating a fine-tuning dataset for code translation. The final dataset consists of C++ solutions from 2,000 unique problems, and synthetic Python answers, split into train (1,400), validation (300), and test (300) sets. For details on dataset… See the full description on the dataset page: https://huggingface.co/datasets/demoversion/cf-cpp-to-python-code-generation.
e
Deep learning based 3d reconstruction for phenotyping of wheat seeds:...
b2find.eudat.eu
Updated Aug 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). Deep learning based 3d reconstruction for phenotyping of wheat seeds: dataset - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/80efe038-164b-50c8-ac8d-19605da5d4ea
Explore at:
Dataset updated
Aug 23, 2023
Description
We present a new data set for 3d wheat seed reconstruction, propose a challenge “Wheat Seed 3d Reconstruction Challenge”, and provide baseline methods [1]. The dataset consists of 2964 seeds, split into 2520 seeds for training/validation and 444 for testing. Ground truth data for the test set is not provided, however, test results can be evaluated in the “Wheat Seed 3d Reconstruction Challenge” on https://helmholtz-data-challenges.de/. Per seed there are: (1) Point cloud, reconstructed from 36 images from turntable setup (2) 36 projection matrices, projecting point cloud (2) back to corresponding images from turntable setup (3) Top view taken from another camera We provide raw (1) and preprocessed (5) versions of point clouds, and preprocessed versions of images from turntable setup (4). Raw images from turntable setup are available in the other data record on https://b2share.fz-juelich.de/ "Deep learning based 3d reconstruction for phenotyping of wheat seeds: dataset (with raw images)" (http://doi.org/10.34730/3541bf71388946b3a3a906fef7aed491). ======================= Detailed content: (1) raw_point_clouds/ -- .ply -- N points, N<70000, x-y-z coordinates, ascii format (2) projection_matrices/ -- .txt -- 36x4x4 floats (3) 2d_station_images/ -- .png -- 700x700, 24 bit, RGB color (4) preprocessed_3d_station_images/ -- .tif -- stabilized and cropped to 373x200, 8 bit, grayscale (5) preprocessed_gt_point_clouds/ --.ply – 2000 points, x-y-z coordinates, ascii format (6) general_files/ -- additional files, files for competition submission (1) consists of point clouds, one per seed. (5) consists of point clouds in 36 corresponding poses per seed. Raw 3d station images from turntable setup were preprocessed, stabilized and resampled, which results in (4). Raw point clouds (1) were preprocessed (resampled) such that each of them contains 2000 points, lying in the fixed set of directions /general_files/directions.csv. Preprocessed point clouds (5) are convertible to triangular mesh using indices of vertex triplets /general_files/triangles.txt. A sphere sampled with these vertices and triangles is in /general_files/fibo_msh.ply. The test set does not include point clouds, and has 3 views per seed: 0, 120 and 240 degrees. Zero-padded integers XXXX are seed indices (here depicted 0000 and 0003). Zero-padded integers YYY in rotation_YYY are rotation angles in degrees: 0, 10, .. 350 degrees. Indices of train and test sets are provided in general_files/indices_train.txt and general_files/indices_test.txt files. Volumes of raw point clouds of the train set are located in general_files/train_gt_volumes.csv. Point cloud files .ply (1) and (5) were created with open3d python library, the header of the file is: format ascii 1.0 element vertex ????? property float x property float y property float z end_header These .ply are visualizable with, e.g., MeshLab (https://www.meshlab.net/) in windows. There are two files in this data record: seed_dataset.zip and numpy_arrays.zip. The file seed_dataset.zip contains the data, where each file is presented separately. The file numpy_arrays.zip contains numerical arrays, aggregating these separate files into multidimensional arrays readable with the python library numpy as np.load('file_name.npy'). Structure of the data after extraction of the numpy_arrays.zip into corresponding directories:
h
codeparrot-valid-more-filtering
huggingface.co
Updated Apr 27, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
CodeParrot (2022). codeparrot-valid-more-filtering [Dataset]. https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 27, 2022
Dataset provided by
Good Engineering, Inc
Authors
CodeParrot
Description
CodeParrot 🦜 Dataset Cleaned and filtered (validation)

Dataset Description

A dataset of Python files from Github. It is a more filtered version of the validation split codeparrot-clean-valid of codeparrot-clean. The additional filters aim at detecting configuration and test files, as well as outlier files that are unlikely to help the model learn code. The first three filters are applied with a probability of 0.7:

files with a mention of "test file" or "configuration… See the full description on the dataset page: https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering.
STEAD subsample 4 CDiffSD
zenodo.org
bin
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11094536
Dataset updated
Apr 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniele Trappolini; Daniele Trappolini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 15, 2024
Description
STEAD Subsample Dataset for CDiffSD Training

Overview

This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

Dataset Files

The dataset includes the following files:

train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.

noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.

test: Used for the testing phase, structured similarly to train.

noise_test: Used for the testing phase, contains noise data for testing.

Each file is structured to support the training and evaluation of seismic denoising models.

Data

The HDF5 files named noise contain two main datasets:

traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).

metadata: This dataset contains the names of the traces for each event.

Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

p_arrival: Contains the arrival indices of P-waves, expressed in counts.

s_arrival: Contains the arrival indices of S-waves, also expressed in counts.

Usage

To load these files in a Python environment, use the following approach:

```python

import h5py
import numpy as np

# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))

if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces

if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

Ensure that the path to the file is correctly specified relative to your Python script.

Requirements

To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

```bash
pip install numpy
pip install h5py
```
T
Data from: dtd
tensorflow.org
Updated Dec 6, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). dtd [Dataset]. https://www.tensorflow.org/datasets/catalog/dtd
Explore at:
Dataset updated
Dec 6, 2022
Description
The Describable Textures Dataset (DTD) is an evolving collection of textural images in the wild, annotated with a series of human-centric attributes, inspired by the perceptual properties of textures. This data is made available to the computer vision community for research purposes.

The "label" of each example is its "key attribute" (see the official website). The official release of the dataset defines a 10-fold cross-validation partition. Our TRAIN/TEST/VALIDATION splits are those of the first fold.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dtd', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/dtd-3.0.1.png" alt="Visualization" width="500px">
e
JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...
b2find.eudat.eu
Updated Apr 19, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). JSON dataset för simulerad byggnadsvärmekontroll för system-av-system interoperabilitet - Dataset - B2FIND [Dataset]. https://b2find.eudat.eu/dataset/442bc87f-092d-57d9-a2f0-ba1c7e049d36
Explore at:
Dataset updated
Apr 19, 2022
Description
Interoperability in systems-of-systems is a difficult problem due to the abundance of data standards and formats. Current approaches to interoperability rely on hand-made adapters or methods using ontological metadata. This dataset was created to facilitate research on data-driven interoperability solutions. The data comes from a simulation of a building heating system, and the messages sent within control systems-of-systems. For more information see attached data documentation. The data comes in two semicolon-separated (;) csv files, training.csv and test.csv. The train/test split is not random; training data comes from the first 80% of simulated timesteps, and the test data is the last 20%. There is no specific validation dataset, the validation data should instead be randomly selected from the training data. The simulation runs for as many time steps as there are outside temperature values available. The original SMHI data only samples once every hour, which we linearly interpolate to get one temperature sample every ten seconds. The data saved at each time step consists of 34 JSON messages (four per room and two temperature readings from the outside), 9 temperature values (one per room and outside), 8 setpoint values, and 8 actuator outputs. The data associated with each of those 34 JSON-messages is stored as a single row in the tables. This means that much data is duplicated, a choice made to make it easier to use the data. The simulation data is not meant to be opened and analyzed in spreadsheet software, it is meant for training machine learning models. It is recommended to open the data with the pandas library for Python, available at https://pypi.org/project/pandas/. Datasetet innehåller simulerad servicedata för system-av-system interoperabilitetsforskning. För mer information se bifogad dokumentation och den engelska katalogsidan. Data kommer i två semikolonseparerade (;) csv-filer, training.csv och test.csv. Träning/testfördelningen är inte slumpmässig; träningsdata kommer från de första 80 % av de simulerade tidsstegen och testdata är de sista 20 %. Det finns ingen specifik valideringsdatauppsättning, valideringsdatan bör istället väljas slumpmässigt från träningsdatan. Simuleringen körs i lika många tidssteg som det finns tillgängliga utetemperaturvärden. De ursprungliga SMHI-data samplar bara en gång i timmen, som linjärt interpolerar för att få ett temperaturprov var tionde sekund. Data som sparas vid varje tidssteg består av 34 JSON-meddelanden (fyra per rum och två temperaturavläsningar utifrån), 9 temperaturvärden (ett per rum och utanför), 8 börvärden och 8 ställdonutgångar. Data som är associerade med vart och ett av dessa 34 JSON-meddelanden lagras som en enda rad i tabellerna. Detta innebär att mycket data dupliceras, ett val som görs för att göra det lättare att använda datan. Simuleringsdata är inte avsedd att öppnas och analyseras i kalkylprogram, det är avsett att träna maskininlärningsmodeller. Det rekommenderas att öppna data med pandas-biblioteket för Python, tillgängligt på https://pypi.org/project/pandas/. Building temperature simulation. Simulering av byggnadstemperatur. Simulation
Astur Apple image Dataset
zenodo.org
portalinvestigacion.uniovi.es
+1more
bin, txt
Updated Oct 12, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Menendez Díaz, Agustín,; Menendez Díaz, Agustín,; Silverio García-Cortés; Silverio García-Cortés; José Alberto Oliveira Prendes; José Alberto Oliveira Prendes; Antonio Bello-García; Antonio Bello-García (2022). Astur Apple image Dataset [Dataset]. http://doi.org/10.5281/zenodo.7188796
Explore at:
txt, binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7188796
Dataset updated
Oct 12, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Menendez Díaz, Agustín,; Menendez Díaz, Agustín,; Silverio García-Cortés; Silverio García-Cortés; José Alberto Oliveira Prendes; José Alberto Oliveira Prendes; Antonio Bello-García; Antonio Bello-García
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Struture of file: 2022-v2-9classes-AsturApple-Balanced-SIZE224-train-dev-test.hdf5
-------------------------------------------------------------------------------------------
This file contains dataset of 6108 cider apple color images, 224x224 pixels to support the article "Transfer learning with convolutional neural networks
for classification of cider apple varieties" results.

-The images belong to nine apple classes: 'BLANQUINA' 'CARRIO' 'FLORINA' 'FUENTES' 'PRIETA' 'RAXAO' 'REINETA ENCARNADA' 'REINETA PINTA' 'REINETA ROJA DEL CANADA'
-Full dataset is split in 4886 images for training, 611 for testing and 611 for validation.
-Class labels are codified as one-hot enconding binary labels. (e.g. [0 0 0 0 0 1 0] ---> Reineta Pinta)
-Training image set are store as tensor "trainX": (4168,224,224,3)
-Training class labels "trainY": (4886,7)
-Test image set tensor "testX": (611, 224,224,3)
-Test image binary class labels "testY": (611, 7)
-Validation image set tensor "devX": (611, 224,224,3)
-Validation binary class labels "devY": (611, 7)

* file was created with h5py module in Python.

Facebook

Twitter

Click to copy link

Link copied

Cite

Juliane Köhler; Juliane Köhler (2025). Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft [Dataset]. http://doi.org/10.5281/zenodo.6957842

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

Explore at:

text/x-python, csv, binAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6957842

Dataset updated

Apr 24, 2025

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Juliane Köhler; Juliane Köhler

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Cleaned_Dataset.csv – The combined CSV files of all scraped documents from DABI, e-LiS, o-bib and Springer.
Data_Cleaning.ipynb – The Jupyter Notebook with python code for the analysis and cleaning of the original dataset.
ger_train.csv – The German training set as CSV file.
ger_validation.csv – The German validation set as CSV file.
en_test.csv – The English test set as CSV file.
en_train.csv – The English training set as CSV file.
en_validation.csv – The English validation set as CSV file.
splitting.py – The python code for splitting a dataset into train, test and validation set.
DataSetTrans_de.csv – The final German dataset as a CSV file.
DataSetTrans_en.csv – The final English dataset as a CSV file.
translation.py – The python code for translating the cleaned dataset.

Clear search

Close search

Google apps

Main menu

Data Cleaning, Translation & Split of the Dataset for the Automatic...

Multimodal Vision-Audio-Language Dataset

Annotation

Installation

Example

Description

Data files

Data from: ManyTypes4Py: A benchmark Python Dataset for Machine...

Street View House Numbers

hate_speech_dataset

ref_coco

Tour Recommendation Model

Dataset Description for Tour Recommendation Model

Context and Methodology:

Technical Details:

Further Details:

Rescaled CIFAR-10 dataset

wikihow

cinic-10 Lance Dataset

Data from: JSON Dataset of Simulated Building Heat Control for System of...

Data for "Prediction of Phakic Intraocular Lens Vault Using Machine Learning...

connect data in your google drive

Change the path for the custom data

In this case, we used ICL vault prediction using preop measurement

optimal features (sorted by importance) :

1. ICL size 2. ICL power 3. LV 4. CLR 5. ACD 6. ATA

7. MSE 8.Age 9. Pupil size 10. WTW 11. CCT 12. ACW

Split the dataset to train and test data, if necessary.

For example, we can split data to 8:2 as a simple validation test

In our study, we already defined the training (B&VIIT Eye Center, n=1455) and test (Kitasato University, n=290) dataset, this code was not necessary to perform our analysis.

Optimal parameter search could be performed in this section

speech_commands

cf-cpp-to-python-code-generation

Deep learning based 3d reconstruction for phenotyping of wheat seeds:...

codeparrot-valid-more-filtering

STEAD subsample 4 CDiffSD

STEAD Subsample Dataset for CDiffSD Training

Overview

Dataset Files

Data

Usage

Requirements

Data from: dtd

JSON dataset för simulerad byggnadsvärmekontroll för system-av-system...

Astur Apple image Dataset

Data Cleaning, Translation & Split of the Dataset for the Automatic Classification of Documents for the Classification System for the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft