76 datasets found

h
lexclipr
huggingface.co
Updated Mar 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Rohit Upadhya (2025). lexclipr [Dataset]. https://huggingface.co/datasets/rohit-upadhya/lexclipr
Explore at:
Dataset updated
Mar 24, 2025
Authors
Rohit Upadhya
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Dataset Division:

Unique_query : contains unique queries that are not present in train, val, test split. Used for testing unseen understandability of models. Train_all : contains the unsplit train, test, val datapoints. train : train split val : val split test : test split
Data from: Temporal Validity Change Prediction - Dataset
zenodo.org
data.niaid.nih.gov
csv
Updated Apr 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Georg Wenzel; Georg Wenzel (2025). Temporal Validity Change Prediction - Dataset [Dataset]. http://doi.org/10.5281/zenodo.8340858
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.8340858
Dataset updated
Apr 24, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Georg Wenzel; Georg Wenzel
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset contains data for temporal validity change prediction, an NLP task that will be defined in an upcoming publication. The dataset consists of five columns.

target - A Tweet ID. This column must be manually rehydrated via the Twitter API to obtain the tweet text.

follow_up - A synthetic follow-up tweet that semantically relates to the target tweet.

context_only_tv - The expected temporal validity duration of the target tweet, when read in isolation.

combined_tv - The expected temporal validity duration of the target tweet, when read together with the follow-up tweet.

change - The TVCP task label, i.e., whether the temporal validity duration of the target tweet is decreased, unchanged (neutral), or increased by the information in the follow-up tweet.

The duration labels (context_only_tv, combined_tv) are class indices of the following class distribution:
[no time-sensitive information, less than one minute, 1-5 minutes, 5-15 minutes, 15-45 minutes, 45 minutes - 2 hours, 2-6 hours, more than 6 hours, 1-3 days, 3-7 days, 1-4 weeks, more than one month]

Different dataset splits are provided.

"dataset.csv" contains the full dataset.

"train.csv", "val.csv", "test.csv" contain an 80-10-10 train-val-test split.

"train[0-4].csv" and "test[0-4].csv" respectively contain training and test data for one of 5 folds for 5-fold cross-validation. The train file contains 80% of the data, while the test file contains 20%. To replicate the original experiments, the train file should be sorted by the preprocessed target tweet text, then the first 12.5% of target tweets should be sampled to generate validation data, leading to a 70-10-20 train-val-test split.
h
Debunk_Traffic_Representation
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Yuqi Zhao, Debunk_Traffic_Representation [Dataset]. https://huggingface.co/datasets/rigcor7/Debunk_Traffic_Representation
Explore at:
Authors
Yuqi Zhao
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Packet-level classification: classify based on packet

Per-packet-split: Mix all packets and split them into train, val, and test sets, based on 8:1:1 Per-flow-split: Split the pcap files based on 5-tuples (src_IP, dst_IP, src_port, dst_port, and protocol), using 3-fold validation, and there is no intersection between the train, val, and test sets.

Flow-level classification: classify based on flow
h
baseball_players
huggingface.co
Updated Mar 27, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
big (2024). baseball_players [Dataset]. https://huggingface.co/datasets/yangcci/baseball_players
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2024
Authors
big
License
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
Description
dataset_info: features: -name: image dtype: image -name: question dtype: string -name: caption dtype: string

splits: -name: train num_bytes: 1,572,864 num_examples: 40

-name: test num_bytes: 764,825.6 num_examples: 20

-name: val num_bytes: 961,740.8 num_examples: 20

configs: data_files: -split: train path: data/train -split: test path: data/test -split: val path: data/val
Z
Downsized camera trap images for automated classification
data.niaid.nih.gov
zenodo.org
Updated Dec 1, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wearne, Oliver R (2022). Downsized camera trap images for automated classification [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6627706
Explore at:
Dataset updated
Dec 1, 2022
Dataset provided by
Norman, Danielle L
Heon, Sui P
Ewers, Robert M
Chapman, Philip M
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:

NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:

Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:

filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance
R
Solar flare forecasting based on magnetogram sequences learning with MViT...
redu.unicamp.br
data.niaid.nih.gov
+1more
Updated Jul 15, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Repositório de Dados de Pesquisa da Unicamp (2024). Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation [Dataset]. http://doi.org/10.25824/redu/IH0AH0
Explore at:
Unique identifier
https://doi.org/10.25824/redu/IH0AH0
Dataset updated
Jul 15, 2024
Dataset provided by
Repositório de Dados de Pesquisa da Unicamp
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Dataset funded by
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Description
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.
Machine Learning for Bird Song Learning (ML4BL) dataset
zenodo.org
zip
Updated Oct 3, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lies Zandberg; Veronica Morfi; Julia George; David F. Clayton; Dan Stowell; Robert F. Lachlan; Lies Zandberg; Veronica Morfi; Julia George; David F. Clayton; Dan Stowell; Robert F. Lachlan (2021). Machine Learning for Bird Song Learning (ML4BL) dataset [Dataset]. http://doi.org/10.5281/zenodo.5545872
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.5545872
Dataset updated
Oct 3, 2021
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lies Zandberg; Veronica Morfi; Julia George; David F. Clayton; Dan Stowell; Robert F. Lachlan; Lies Zandberg; Veronica Morfi; Julia George; David F. Clayton; Dan Stowell; Robert F. Lachlan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
General description

This dataset contains Zebra Finch decisions about perceptual similarity on song units. All the data and files are used for reproducing the results of the paper 'Bird song comparison using deep learning trained from avian perceptual judgments' by the same authors.

Git repo on Zenodo: https://doi.org/10.5281/zenodo.5545932
Git repo access: https://github.com/veronicamorfi/ml4bl/tree/v1.0.0

Directory organisation:
ML4BL_ZF
|_files
|_Final_probes_20200816.csv - all trials and decisions of the birds (aviary 1 cycle 1 data are removed from experiments)
|_luscinia_triplets_filtered.csv - triplets to use for training
|_mean_std_luscinia_pretraining.pckl - mean and std of luscinia triplets used for trianing
|_*_cons_* - % side consistency on triplets (train/test) - train set contains both train and val splits
|_*_gt_* - cycle accuracy for triplets of the specific bird (train/test) - train set contains both train and val splits
|_*_trials_* - number of decisions made for a triplet (train/test) - train set contains both train and val splits
|_*_triplets_* - triplet information (aviary_cycle-acc_birdID, POS, NEG, ANC) (train/test) - train set contains both train and val splits
|_*_low*_ - low-margin (ambiguous) triplets (train/val/test)
|_*_high_ - high-margin (unambiguous) triplets (train/val/test)
|_*_cycle_bird_keys_* - unique aviary_cycle-acc_birdID keys (train/test) - train set contains both train and val splits
|_TunedLusciniaV1e.csv - pairwise distance of two recordings computed by Luscinia
|_training_setup_1_ordered_acc_single_cons_50_70_trials.pckl - dictionary containing everything needed for training the model (keys: 'train_keys', 'train_triplets', 'val_keys', 'vali_triplets', 'test_triplets', 'test_keys', 'train_mean', 'train_std')
|_melspecs - *.pckl - melspectrograms of recordings
|_wavs - *wav - recordings
|_README.txt

Recordings

887 syllables extracted from zebra finch song recordings, with a sampling rate of 48kHz and high pass filtered (100Hz), with a 20ms intro/outro fade.

Decisions

Triplets were created from the recordings and the birds made side based decisions about their similarity (see 'Bird song comparison using deep learning trained from avian perceptual judgments' for further information).

Training dictionary Information

Dictionary keys:
'train_keys', 'train_triplets', 'val_keys', 'vali_triplets', 'test_triplets', 'test_keys', 'train_mean', 'train_std'

train_triplets/vali_triplets/test_triplets:
Aviary_Cycle_birdID, POS, NEG, ANC, Decisions, Cycle_ACC(%), Consistency(%)

train_keys/val_keys/test_keys:
Aviary_Cycle_birdID

train_mean/train_std:
shape: (1, mel_bins)

Open Access

This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Contact info

Please send any questions about the recordings to:
Lies Zandberg: Elisabeth.Zandberg@rhul.ac.uk

Please send any feedback or questions about the code and the rest of the data to:
Veronica Morfi: g.v.morfi@qmul.ac.uk
posebowl
kaggle.com
Updated Mar 19, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Avishek Parajuli (2024). posebowl [Dataset]. https://www.kaggle.com/datasets/aparajuli/posebowl
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 19, 2024
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Avishek Parajuli
Description
This is a dataset for the competition named "Pose Bowl: Detection Track" hosted in Data-driven. Please visit the original competition website for detailed information. All rights and credits goes to the original authors. This dataset is compressed one(jpg instead of png) and already split into train, test and val. Purpose of this is to help kagglers to get started and use it for custom training with YOLO variations.

Split information: - train (18399) - val(3701) -test(3701)
h
dusha_balanced
huggingface.co
Updated Jun 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Kseniia Zolina (2025). dusha_balanced [Dataset]. https://huggingface.co/datasets/nixiieee/dusha_balanced
Explore at:
Dataset updated
Jun 12, 2025
Authors
Kseniia Zolina
Description
Dataset Details

Dataset 'Dusha' split into train, val and test. Half of original train was taken, test split in halfs for val and test, 'neutral' category was cut to make the label distribution more balanced

ref_coco

tensorflow.org
opendatalab.com

Updated May 31, 2024

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). ref_coco [Dataset]. https://www.tensorflow.org/datasets/catalog/ref_coco

Explore at:

Dataset updated

May 31, 2024

Description

A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.

RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.

Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".

Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):

dataset	partition	split	refs	images
refcoco	google	train	40000	19213
refcoco	google	val	5000	4559
refcoco	google	test	5000	4527
refcoco	unc	train	42404	16994
refcoco	unc	val	3811	1500
refcoco	unc	testA	1975	750
refcoco	unc	testB	1810	750
refcoco+	unc	train	42278	16992
refcoco+	unc	val	3805	1500
refcoco+	unc	testA	1975	750
refcoco+	unc	testB	1798	750
refcocog	google	train	44822	24698
refcocog	google	val	5000	4650
refcocog	umd	train	42226	21899
refcocog	umd	val	2573	1300
refcocog	umd	test	5023	2600

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">

OQM9HK: A Large-scale Graph Dataset for Machine Learning in Materials...
zenodo.org
bin
Updated Mar 11, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Takenori Yamamoto; Takenori Yamamoto (2024). OQM9HK: A Large-scale Graph Dataset for Machine Learning in Materials Science [Dataset]. http://doi.org/10.5281/zenodo.7124330
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.7124330
Dataset updated
Mar 11, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Takenori Yamamoto; Takenori Yamamoto
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a large-scale graph dataset of materials science based on the Open Quantum Materials Database (OQMD) v1.5 .

Technical Report

RIMCS Website

Data Loading

A Python code example:

import sys sys.path.append('/your/path/to/data/OQM9HK_BEL') import OQM9HK bel_path='/your/path/to/data/OQM9HK_BEL' config = OQM9HK.load_config(path=bel_path) print(config['atomic_numbers']) split = OQM9HK.load_split(path=bel_path) print(len(split['train']), len(split['val']), len(split['test'])) graph_data = OQM9HK.load_graph_data(path=bel_path) name = next(iter(graph_data)) # Frist entry's name graph = graph_data[name] # Graph object print(graph.nodes) print(graph.edge_sources) print(graph.edge_targets) dataset = OQM9HK.load_targets(path=bel_path) # Pandas dataframe print(dataset) train_set = dataset.iloc[split['train']] val_set = dataset.iloc[split['val']] test_set = dataset.iloc[split['test']]
Balanced Affectnet Dataset (75×75, RGB)
kaggle.com
Updated Apr 30, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced Affectnet Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-affectnet
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 30, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
dolly prajapati 182
Description
The Balanced Affectnet Dataset is a uniformly processed, class-balanced, and augmented version of the affect-fer composite dataset. This curated version is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to enhance model performance and comparability.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the Affectnet dataset

Image Format: RGB .png

Image Size: 75 × 75 pixels

Emotion 8-Classes: Anger Contempt disgust fear happy neutral sad surprise

Total Images: 41,008

Images per Class: 5,126

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to grayscale

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 4100 images

Training (80% of remainder): 29526 images

Validation (20% of remainder): 7,382 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
COCO 2014 Dataset (for YOLOv3)
kaggle.com
Updated Sep 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jeff Faudi (2021). COCO 2014 Dataset (for YOLOv3) [Dataset]. https://www.kaggle.com/datasets/jeffaudi/coco-2014-dataset-for-yolov3
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Sep 9, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Jeff Faudi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Context

The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 164K images.

This is the original version from 2014 made available here for easy access in Kaggle and because it does not seem to be still available on the COCO Dataset website. This has been retrieved from the mirror that Joseph Redmon has setup on this own website.

Content

The 2014 version of the COCO dataset is an excellent object detection dataset with 80 classes, 82,783 training images and 40,504 validation images. This dataset contains all this imagery on two folders as well as the annotation with the class and location (bounding box) of the objects contained in each image.

The initial split provides training (83K), validation (41K) and test (41K) sets. Since the split between training and validation was not optimal in the original dataset, there is also two text (.part) files with a new split with only 5,000 images for validation and the rest for training. The test set has no labels and can be used for visual validation or pseudo-labelling.

Acknowledgements

This is mostly inspired by Erik Linder-Norén and [Joseph Redmon](https://pjreddie.com/darknet/yolo
Z
Wikidata subset with revision history information [RDF]
data.niaid.nih.gov
Updated Jun 6, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gonzalez-Hevia, Alejandro (2022). Wikidata subset with revision history information [RDF] [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_6613874
Explore at:
Dataset updated
Jun 6, 2022
Dataset authored and provided by
Gonzalez-Hevia, Alejandro
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset is composed of 300 instances from the 100 most important classes in Wikidata, for a total of around 30000 entities and 390000 triples. The dataset is geared towards knowledge graph refinement models that leverage edit history information from the graph. There are two versions of the dataset:

The static version (files postfixed with '_static') contains the simple statements of each entity fetched from Wikidata.

The dynamic version (files postfixed with '_dynamic') contains information about the operations and revisions made to these entities, and the triples that were added or removed.

Each version is split into three subsets: train, validation (val), and test. Each split contains every entity from the dataset. The train split contains the first 70% of revisions made to each entity, the validation split contains the 70% to 85% revisions, and the test set contains the last 15% revisions.

This is a sample from the static datasets:

wd:Q217432 a uo:entity ; wdt:P1082 1.005904e+06 ; wdt:P1296 "0052280" ; wdt:P1791 wd:Q18704103 ; wdt:P18 "Pitakwa.jpg" ; wdt:P244 "n80066826" ; wdt:P571 "+1912-00-00T00:00:00Z" ; wdt:P6766 "421180027" .

Each entity has the type uo:entity, and contains the statements added during that time period following Wikidata's data model.

In the following code snippet we show an example from the dynamic dataset:

uo:rev703872813 a uo:revision ; uo:timestamp "2018-06-28T22:31:32Z" .

uo:op703872813_0 a uo:operation ; uo:fromRevision uo:rev703872813 ; uo:newObject wd:Q82955 ; uo:opType uo:add ; uo:revProp wdt:P106 ; uo:revSubject wd:Q6097419 .

uo:op703878666_0 a uo:operation ; uo:fromRevision uo:rev703878666 ; uo:opType uo:remove ; uo:prevObject wd:Q1108445 ; uo:revProp wdt:P460 ; uo:revSubject wd:Q1147883 .

This dataset is composed of revisions, which have a timestamp. Each revision is composed of 1 to n operations, in which there is a change to a statement from the entity. There are two types of operations: uo:add and uo:remove. In both cases, the property and the subject being modified are shown with the uo:revProp and uo:revSubject properties. In the case of additions, uo:newObject and uo:prevObject properties are added to show the previous and new objects after the addition. In the case of removals, there is a uo:prevObject property to record the object that was removed.
o
MER Opportunity and Spirit Rovers Pancam Images Labeled Data Set
explore.openaire.eu
Updated Dec 3, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Brandon Zhao; Shoshanna Cole; Steven Lu (2020). MER Opportunity and Spirit Rovers Pancam Images Labeled Data Set [Dataset]. http://doi.org/10.5281/zenodo.4302759
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4302759
Dataset updated
Dec 3, 2020
Authors
Brandon Zhao; Shoshanna Cole; Steven Lu
Description
Introduction The data set is based on 3,004 images collected by the Pancam instruments mounted on the Opportunity and Spirit rovers from NASA's Mars Exploration Rovers (MER) mission. We used rotation, skewing, and shearing augmentation methods to increase the total collection to 70,864 (see Image Augmentation section for more information). Based on the MER Data Catalog User Survey [1], we identified 25 classes of both scientific (e.g. soil trench, float rocks, etc.) and engineering (e.g. rover deck, Pancam calibration target, etc.) interests (see Classes section for more information). The 3,004 images were labeled on Zooniverse platform, and each image is allowed to be assigned with multiple labels. The images are either 512 x 512 or 1024 x 1024 pixels in size (see Image Sampling section for more information). Classes There is a total of 25 classes for this data set. See the list below for class names, counts, and percentages (the percentages are computed as count divided by 3,004). Note that the total counts don't sum up to 3,004 and the percentages don't sum up to 1.0 because each image may be assigned with more than one class. Class name, count, percentage of dataset Rover Deck, 222, 7.39% Pancam Calibration Target, 14, 0.47% Arm Hardware, 4, 0.13% Other Hardware, 116, 3.86% Rover Tracks, 301, 10.02% Soil Trench, 34, 1.13% RAT Brushed Target, 17, 0.57% RAT Hole, 30, 1.00% Rock Outcrop, 1915, 63.75% Float Rocks, 860, 28.63% Clasts, 1676, 55.79% Rocks (misc), 249, 8.29% Bright Soil, 122, 4.06% Dunes/Ripples, 1000, 33.29% Rock (Linear Features), 943, 31.39% Rock (Round Features), 219, 7.29% Soil, 2891, 96.24% Astronomy, 12, 0.40% Spherules, 868, 28.89% Distant Vista, 903, 30.23% Sky, 954, 31.76% Close-up Rock, 23, 0.77% Nearby Surface, 2006, 66.78% Rover Parts, 301, 10.02% Artifacts, 28, 0.93% Image Sampling Images in the MER rover Pancam archive are of sizes ranging from 64x64 to 1024x1024 pixels. The largest size, 1024x1024, was by far the most common size in the archive. For the deep learning dataset, we elected to sample only 1024x1024 and 512x512 images as the higher resolution would be beneficial to feature extraction. In order to ensure that the data set is representative of the total image archive of 4.3 million images, we elected to sample via "site code". Each Pancam image has a corresponding two-digit alphanumeric "site code" which is used to track location throughout its mission. Since each "site code" corresponds to a different general location, sampling a fixed proportion of images taken from each site ensure that the data set contained some images from each location. In this way, we could ensure that a model performing well on this dataset would generalize well to the unlabeled archive data as a whole. We randomly sampled 20% of the images at each site within the subset of Pancam data fitting all other image criteria, applying a floor function to non-whole number sample sizes, resulting in a dataset of 3,004 images. Train/validation/test sets split The 3,004 images were split into train, validation, and test data sets. The split was done so that roughly 60, 15, and 25 percent of the 3,004 images would end up as train, validation, and test data sets respectively, while ensuing that images from a given site are not split between train/validaiton/test data sets. This resulted in 1,806 train images, 456 validation images, and 742 test images. Augmentation To augment the images in train and validation data sets (note that images in the test data set were not augmented), three augmentation methods were chosen that best represent transformations that could be realistically seen in Pancam images. The three augmentations methods are rotation, skew, and shear. The augmentation methods were applied with random magnitude, followed by a random horizontal flipping, to create 30 augmented images for each image. Since each transformation is followed by a square crop in order to keep input shape consistent, we had to constrict the magnitude limits of each augmentation to avoid cropping out important features at the edges of input images. Thus, rotations were limited to 15 degrees in either direction, the 3-dimensional skew was limited to 45 degrees in any direction, and shearing was limited to 10 degrees in either direction. Note that augmentation was done only on training and validation images. Directory Contents images: contains all 70,864 images train-set-v1.1.0.txt: label file for the training data set val-set-v1.1.0.txt: label file for the validation data set test-set-v1.1.0.txt: label file for the testing data set Images with relatively short file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg) are original images, and images with long file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg_04140167-5781-49bd-a913-6d4d0a61dab1.jpg) are augmented images. The label files are formatted as "Image name, Class1, Class2, ..., ClassN". Reference [1] S.B. Cole, J.C. Aubele, B.A. Cohen, S.M. Milkovich, and S.A...
f
transition1x splits
figshare.com
bin
Updated Dec 12, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Junha Lee (2023). transition1x splits [Dataset]. http://doi.org/10.6084/m9.figshare.24792828.v1
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.24792828.v1
Dataset updated
Dec 12, 2023
Dataset provided by
figshare
Authors
Junha Lee
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
.db versions of the train/test/val splits as described by https://gitlab.com/matschreiner/Transition1x.For Fall 2023 CS224W Final Project
h
ncRPI
huggingface.co
Updated May 13, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Vladimir Kovačević (2025). ncRPI [Dataset]. https://huggingface.co/datasets/vladak/ncRPI
Explore at:
Dataset updated
May 13, 2025
Authors
Vladimir Kovačević
Description
Dataset Card for ncRPI

Summary

The ncRPI dataset is part of the LUCAONE downstream tasks collection for biomolecular interaction prediction. It is structured for binary classification and includes standard splits for training (train.csv), validation (dev.csv → val), and test (test.csv).

Dataset Structure

This dataset includes three splits:

train val (converted from dev.csv) test

Each split is in CSV format.

Task

Binary classification of… See the full description on the dataset page: https://huggingface.co/datasets/vladak/ncRPI.
Balanced Image-FER Dataset (75×75, RGB)
kaggle.com
Updated Apr 29, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
dolly prajapati 182 (2025). Balanced Image-FER Dataset (75×75, RGB) [Dataset]. https://www.kaggle.com/datasets/dollyprajapati182/balanced-image-fer-dataset-7575-rgb
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Apr 29, 2025
Dataset provided by
Kagglehttp://kaggle.com/
Authors
dolly prajapati 182
Description
The Balanced Image-FER Dataset is a uniformly processed, class-balanced, and augmented version of the original FER2013 Dataset Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.

🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.

🧾 Dataset Characteristics Source: Based on the Image-FER Dataset

Image Format: RGB .png

Image Size: 75 × 75 pixels

Emotion Classes:

anger contempt disgust fear happiness neutral sadness surprise

⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:

✅ Converted to RGB

✅ Resized to 75×75 pixels

✅ Augmented using:

Random rotation

Horizontal flip

Brightness adjustment

Contrast enhancement

Sharpness modification

This results in a clean, uniform, and diverse dataset ideal for FER tasks.

Testing (10%): 687 images

Training (80% of remainder): 4,943 images

Validation (20% of remainder): 1,236 images

✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions

🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead

🚀 Augmented: Improves model generalization and robustness

📦 Split Ready: Train/Val/Test folders structured per class

📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Organ-on-a-Chip (OOC) Image Dataset
zenodo.org
data.niaid.nih.gov
bin, zip
Updated Nov 24, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols; Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols (2023). Organ-on-a-Chip (OOC) Image Dataset [Dataset]. http://doi.org/10.5281/zenodo.10203721
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.10203721
Dataset updated
Nov 24, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols; Valērija Movčana; Arnis Strods; Karīna Narbute; Fēlikss Rūmnieks; Roberts Rimša; Gatis Mozolevskis; Roberts Kadiķis; Maksims Ivanovs; Kārlis Gustavs Zviedris; Laura Leja; Anastasija Zujeva; Tamāra Laimiņa; Arturs Abols
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Overview: This dataset contains 3000+ images generated from OOC (organ-on-a-chip) setup with different cell types. The images were generated by an automated brightfield microscopy setup; for each image, such parameters as cell type, time after seeding, and class label ('good' or 'bad' sample quality as assessed by a biology expert) are provided. Furthermore, for some images, seeding density and flow rate are given as well. The dataset can be used for training machine learning classifiers for the automated analysis of the data generated with OOC setup, allowing to create more reliable tissue models and automate decision making processes for growing OOC.
The dataset comprises images of OOC samples from the following cell lines:
A549 (human lung adenocarcinoma alveolar basal epithelial cells, CCL-185, ATTC)
Caco-2 (colorectal adenocarcinoma epithelial cells, HTB-37, ATCC)
HPMEC (human pulmonary microvascular endothelial cells; 3000, ScienCell)
HUVEC (human umbilical vein endothelial cells, CRL-1730, ATCC)
NHBE (normal human bronchial epithelial cells, CC-2541, Lonza)
HSAEC (human small airway epithelial cells, PCS-301-010, ATCC)
Structure of the dataset: The dataset is split into three main folders that correspond to the data split for training machine learning models, i.e., 'train', 'val', and 'test'. The train/val/test split is done proportionally with respect to the class labels, cell lines, and time after seeding (see below), yet the data can be split or merged in other ways to suit the needs of prospective users of the dataset. Within each of the main folders, there are a 'bad' and a 'good' folder with the images corresponding to the respective class labels (see 'Overview' above). The images in 'bad' / 'ģood' folders are further subdivided into folders corresponding to respective cell lines, which are in their turn subdivided into folders corresponding to the different times after seeding. Therefore, it is easy to find images of interest, e.g., '4+ days' 'good' images of the cell line A549 from the 'train' dataset. Further information about the images is available in the file 'OOC_datasheet.xlsx'.
Acknowledgement: The work presented in this paper was supported by the project 'AI-improved organ on chip cultivation for personalised medicine (AimOOC)' (contract with Central Finance and Contracting Agency of Republic of Latvia no. 1.1.1.1/21/A/079; the project is co-financed by REACT-EU funding for mitigating the consequences of the pandemic crisis).
o
Refseq datasets for training frame classification
explore.openaire.eu
zenodo.org
Updated Dec 4, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Benjamin Voigt; Oliver Fischer; Christian Krumnow; Christian Herta; Piotr Wojciech Dabrowski (2020). Refseq datasets for training frame classification [Dataset]. http://doi.org/10.5281/zenodo.4306248
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.4306248
Dataset updated
Dec 4, 2020
Authors
Benjamin Voigt; Oliver Fischer; Christian Krumnow; Christian Herta; Piotr Wojciech Dabrowski
Description
The data is based on andomly selected viral and bacterial genomes and the human193(GRCh38.p13) reference genome which were downloaded from GenBank. From each original nucleic acid sequences we created mutliple patches of length 300 in all possible reading frames using a sliding window on the initial sequence and its reversed complement. For the train and val file, the resulting patches are translated to amino acid sequences of length 100 where the DNA_test file contains the nucleic acid sequences patches of length 300. The data is stored in the FASTA format according to the following convention: >{ID}_subsequence{patch index}_frame{frame index}|{class marker}|{frame index} sequence with ID - denotes the ReSeq accession of the original sequence in the Refseq dataset. sequence - either nucleic acid sequence patch of length 300 (DNA_test) or amino acid sequence of length 100 (train, val) patch_index - denotes the starting triplet of the given patch within the original sequence or reverse complemented sequence (i.e. 3*patch_index is the starting index of frame 0 in the original sequence) class marker - indicates the taxonomic domain 0 - virus 1 - bacteria 2 - human / mammal frame index - indicates the reading frame 0 - on-frame 1 - shifted by one 2 - shifted by two 3 - reverse complemented 4 - shifted by one and reverse complemented 5 - shifted by two and reverse complemented The data is split into test, training and validation set which contain the following number of patches per frame: - train: 1.700.944 - test: 212.618 - val: 212.618 The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany (BMBF) in the project deep.Health (project number 13FH770IX6).

Facebook

Twitter

Click to copy link

Link copied

Cite

Rohit Upadhya (2025). lexclipr [Dataset]. https://huggingface.co/datasets/rohit-upadhya/lexclipr

lexclipr

rohit-upadhya/lexclipr

Explore at:

2 scholarly articles cite this dataset (View in Google Scholar)

Dataset updated

Mar 24, 2025

Authors

Rohit Upadhya

License

MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically

Description

Dataset Division:

Unique_query : contains unique queries that are not present in train, val, test split. Used for testing unseen understandability of models. Train_all : contains the unsplit train, test, val datapoints. train : train split val : val split test : test split

Clear search

Close search

Google apps

Main menu

lexclipr

Data from: Temporal Validity Change Prediction - Dataset

Debunk_Traffic_Representation

baseball_players

Downsized camera trap images for automated classification

Solar flare forecasting based on magnetogram sequences learning with MViT...

Machine Learning for Bird Song Learning (ML4BL) dataset

posebowl

dusha_balanced

ref_coco

OQM9HK: A Large-scale Graph Dataset for Machine Learning in Materials...

Balanced Affectnet Dataset (75×75, RGB)

COCO 2014 Dataset (for YOLOv3)

Context

Content

Acknowledgements

Wikidata subset with revision history information [RDF]

MER Opportunity and Spirit Rovers Pancam Images Labeled Data Set

transition1x splits

ncRPI

Balanced Image-FER Dataset (75×75, RGB)

Organ-on-a-Chip (OOC) Image Dataset

Refseq datasets for training frame classification

lexclipr

lexclipr

rohit-upadhya/lexclipr