MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Division:
Unique_query : contains unique queries that are not present in train, val, test split. Used for testing unseen understandability of models. Train_all : contains the unsplit train, test, val datapoints. train : train split val : val split test : test split
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains data for temporal validity change prediction, an NLP task that will be defined in an upcoming publication. The dataset consists of five columns.
The duration labels (context_only_tv, combined_tv) are class indices of the following class distribution:
[no time-sensitive information, less than one minute, 1-5 minutes, 5-15 minutes, 15-45 minutes, 45 minutes - 2 hours, 2-6 hours, more than 6 hours, 1-3 days, 3-7 days, 1-4 weeks, more than one month]
Different dataset splits are provided.
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Packet-level classification: classify based on packet
Per-packet-split: Mix all packets and split them into train, val, and test sets, based on 8:1:1 Per-flow-split: Split the pcap files based on 5-tuples (src_IP, dst_IP, src_port, dst_port, and protocol), using 3-fold validation, and there is no intersection between the train, val, and test sets.
Flow-level classification: classify based on flow
https://choosealicense.com/licenses/unknown/https://choosealicense.com/licenses/unknown/
dataset_info: features: -name: image dtype: image -name: question dtype: string -name: caption dtype: string
splits: -name: train num_bytes: 1,572,864 num_examples: 40
-name: test num_bytes: 764,825.6 num_examples: 20
-name: val num_bytes: 961,740.8 num_examples: 20
configs: data_files: -split: train path: data/train -split: test path: data/test -split: val path: data/val
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description: Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707. Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A) This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip CT_image_data_info2.xlsx This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images) Description: This worksheet details the composition of each dataset used in the analyses Number of fields: 69 Number of data rows: 270287 Fields:
filename: Root ID (Field type: id) camera_trap_site: Site ID for the camera trap location (Field type: location) taxon: Taxon recorded by camera trap (Field type: taxa) dist_level: Level of disturbance at site (Field type: ordered categorical) baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical) increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_1_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_quad_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3, 4 and 5 (quad)' training set, or not included (NA) (Field type: categorical) dist_combined_event_level_all_1_2_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3, 4 and 5 (all)' training set, or not included (NA) (Field type: categorical) dist_camera_level_individ_1: Label as to whether image is included in the 'disturbance level combination analysis split at camera level: disturbance
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Source codes and dataset of the research "Solar flare forecasting based on magnetogram sequences learning with MViT and data augmentation". Our work employed PyTorch, a framework for training Deep Learning models with GPU support and automatic back-propagation, to load the MViTv2 s models with Kinetics-400 weights. To simplify the code implementation, eliminating the need for an explicit loop to train and the automation of some hyperparameters, we use the PyTorch Lightning module. The inputs were batches of 10 samples with 16 sequenced images in 3-channel resized to 224 × 224 pixels and normalized from 0 to 1. Most of the papers in our literature survey split the original dataset chronologically. Some authors also apply k-fold cross-validation to emphasize the evaluation of the model stability. However, we adopt a hybrid split taking the first 50,000 to apply the 5-fold cross-validation between the training and validation sets (known data), with 40,000 samples for training and 10,000 for validation. Thus, we can evaluate performance and stability by analyzing the mean and standard deviation of all trained models in the test set, composed of the last 9,834 samples, preserving the chronological order (simulating unknown data). We develop three distinct models to evaluate the impact of oversampling magnetogram sequences through the dataset. The first model, Solar Flare MViT (SF MViT), has trained only with the original data from our base dataset without using oversampling. In the second model, Solar Flare MViT over Train (SF MViT oT), we only apply oversampling on training data, maintaining the original validation dataset. In the third model, Solar Flare MViT over Train and Validation (SF MViT oTV), we apply oversampling in both training and validation sets. We also trained a model oversampling the entire dataset. We called it the "SF_MViT_oTV Test" to verify how resampling or adopting a test set with unreal data may bias the results positively. GitHub version The .zip hosted here contains all files from the project, including the checkpoint and the output files generated by the codes. We have a clean version hosted on GitHub (https://github.com/lfgrim/SFF_MagSeq_MViTs), without the magnetogram_jpg folder (which can be downloaded directly on https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip) and the output and checkpoint files. Most code files hosted here also contain comments on the Portuguese language, which are being updated to English in the GitHub version. Folders Structure In the Root directory of the project, we have two folders: magnetogram_jpg: holds the source images provided by Space Environment Artificial Intelligence Early Warning Innovation Workshop through the link https://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531804/dataset_ss2sff.zip. It comprises 73,810 samples of high-quality magnetograms captured by HMI/SDO from 2010 May 4 to 2019 January 26. The HMI instrument provides these data (stored in hmi.sharp_720s dataset), making new samples available every 12 minutes. However, the images from this dataset were collected every 96 minutes. Each image has an associated magnetogram comprising a ready-made snippet of one or most solar ARs. It is essential to notice that the magnetograms cropped by SHARP can contain one or more solar ARs classified by the National Oceanic and Atmospheric Administration (NOAA). Seq_Magnetogram: contains the references for source images with the corresponding labels in the next 24 h. and 48 h. in the respectively M24 and M48 sub-folders. M24/M48: both present the following sub-folders structure: Seqs16; SF_MViT; SF_MViT_oT; SF_MViT_oTV; SF_MViT_oTV_Test. There are also two files in root: inst_packages.sh: install the packages and dependencies to run the models. download_MViTS.py: download the pre-trained MViTv2_S from PyTorch and store it in the cache. M24 and M48 folders hold reference text files (flare_Mclass...) linking the images in the magnetogram_jpg folders or the sequences (Seq16_flare_Mclass...) in the Seqs16 folders with their respective labels. They also hold "cria_seqs.py" which was responsible for creating the sequences and "test_pandas.py" to verify head info and check the number of samples categorized by the label of the text files. All the text files with the prefix "Seq16" and inside the Seqs16 folder were created by "criaseqs.py" code based on the correspondent "flare_Mclass" prefixed text files. Seqs16 folder holds reference text files, in which each file contains a sequence of images that was pointed to the magnetogram_jpg folders. All SF_MViT... folders hold the model training codes itself (SF_MViT...py) and the corresponding job submission (jobMViT...), temporary input (Seq16_flare...), output (saida_MVIT... and MViT_S...), error (err_MViT...) and checkpoint files (sample-FLARE...ckpt). Executed model training codes generate output, error, and checkpoint files. There is also a folder called "lightning_logs" that stores logs of trained models. Naming pattern for the files: magnetogram_jpg: follows the format "hmi.sharp_720s...magnetogram.fits.jpg" and Seqs16: follows the format "hmi.sharp_720s...to.", where: hmi: is the instrument that captured the image sharp_720s: is the database source of SDO/HMI. : is the identification of SHARP region, and can contain one or more solar ARs classified by the (NOAA). : is the date-time the instrument captured the image in the format yyyymmdd_hhnnss_TAI (y:year, m:month, d:day, h:hours, n:minutes, s:seconds). : is the date-time when the sequence starts, and follow the same format of . : is the date-time when the sequence ends, and follow the same format of . Reference text files in M24 and M48 or inside SF_MViT... folders follows the format "flare_Mclass_.txt", where: : is Seq16 if refers to a sequence, or void if refers direct to images. : "24h" or "48h". : is "TrainVal" or "Test". The refers to the split of Train/Val. : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. All SF_MViT...folders: Model training codes: "SF_MViT_M+_", where: : void or "oT" (over Train) or "oTV" (over Train and Val) or "oTV_Test" (over Train, Val and Test); : "24h" or "48h"; : "oneSplit" for a specific split or "allSplits" if run all splits. : void is default to run 1 GPU or "2gpu" to run into 2 gpus systems; Job submission files: "jobMViT_", where: : point the queue in Lovelace environment hosted on CENAPAD-SP (https://www.cenapad.unicamp.br/parque/jobsLovelace) Temporary inputs: "Seq16_flare_Mclass_.txt: : train or val; : void or "_over" after the extension (...txt_over): means temporary input reference that was over-sampled by a training model. Outputs: "saida_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the output, or void if the output is from all splits. Error files: "err_MViT_Adam_10-7", where: : k0 to k4, means the correlated split of the error log file, or void if the error file is from all splits. Checkpoint files: "sample-FLARE_MViT_S_10-7-epoch=-valid_loss=-Wloss_k=.ckpt", where: : epoch number of the checkpoint; : corresponding valid loss; : 0 to 4.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
General description
This dataset contains Zebra Finch decisions about perceptual similarity on song units. All the data and files are used for reproducing the results of the paper 'Bird song comparison using deep learning trained from avian perceptual judgments' by the same authors.
Git repo on Zenodo: https://doi.org/10.5281/zenodo.5545932
Git repo access: https://github.com/veronicamorfi/ml4bl/tree/v1.0.0
Directory organisation:
ML4BL_ZF
|_files
|_Final_probes_20200816.csv - all trials and decisions of the birds (aviary 1 cycle 1 data are removed from experiments)
|_luscinia_triplets_filtered.csv - triplets to use for training
|_mean_std_luscinia_pretraining.pckl - mean and std of luscinia triplets used for trianing
|_*_cons_* - % side consistency on triplets (train/test) - train set contains both train and val splits
|_*_gt_* - cycle accuracy for triplets of the specific bird (train/test) - train set contains both train and val splits
|_*_trials_* - number of decisions made for a triplet (train/test) - train set contains both train and val splits
|_*_triplets_* - triplet information (aviary_cycle-acc_birdID, POS, NEG, ANC) (train/test) - train set contains both train and val splits
|_*_low*_ - low-margin (ambiguous) triplets (train/val/test)
|_*_high_ - high-margin (unambiguous) triplets (train/val/test)
|_*_cycle_bird_keys_* - unique aviary_cycle-acc_birdID keys (train/test) - train set contains both train and val splits
|_TunedLusciniaV1e.csv - pairwise distance of two recordings computed by Luscinia
|_training_setup_1_ordered_acc_single_cons_50_70_trials.pckl - dictionary containing everything needed for training the model (keys: 'train_keys', 'train_triplets', 'val_keys', 'vali_triplets', 'test_triplets', 'test_keys', 'train_mean', 'train_std')
|_melspecs - *.pckl - melspectrograms of recordings
|_wavs - *wav - recordings
|_README.txt
Recordings
887 syllables extracted from zebra finch song recordings, with a sampling rate of 48kHz and high pass filtered (100Hz), with a 20ms intro/outro fade.
Decisions
Triplets were created from the recordings and the birds made side based decisions about their similarity (see 'Bird song comparison using deep learning trained from avian perceptual judgments' for further information).
Training dictionary Information
Dictionary keys:
'train_keys', 'train_triplets', 'val_keys', 'vali_triplets', 'test_triplets', 'test_keys', 'train_mean', 'train_std'
train_triplets/vali_triplets/test_triplets:
Aviary_Cycle_birdID, POS, NEG, ANC, Decisions, Cycle_ACC(%), Consistency(%)
train_keys/val_keys/test_keys:
Aviary_Cycle_birdID
train_mean/train_std:
shape: (1, mel_bins)
Open Access
This dataset is available under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Contact info
Please send any questions about the recordings to:
Lies Zandberg: Elisabeth.Zandberg@rhul.ac.uk
Please send any feedback or questions about the code and the rest of the data to:
Veronica Morfi: g.v.morfi@qmul.ac.uk
This is a dataset for the competition named "Pose Bowl: Detection Track" hosted in Data-driven. Please visit the original competition website for detailed information. All rights and credits goes to the original authors. This dataset is compressed one(jpg instead of png) and already split into train, test and val. Purpose of this is to help kagglers to get started and use it for custom training with YOLO variations.
Split information: - train (18399) - val(3701) -test(3701)
Dataset Details
Dataset 'Dusha' split into train, val and test. Half of original train was taken, test split in halfs for val and test, 'neutral' category was cut to make the label distribution more balanced
A collection of 3 referring expression datasets based off images in the COCO dataset. A referring expression is a piece of text that describes a unique object in an image. These datasets are collected by asking human raters to disambiguate objects delineated by bounding boxes in the COCO dataset.
RefCoco and RefCoco+ are from Kazemzadeh et al. 2014. RefCoco+ expressions are strictly appearance based descriptions, which they enforced by preventing raters from using location based descriptions (e.g., "person to the right" is not a valid description for RefCoco+). RefCocoG is from Mao et al. 2016, and has more rich description of objects compared to RefCoco due to differences in the annotation process. In particular, RefCoco was collected in an interactive game-based setting, while RefCocoG was collected in a non-interactive setting. On average, RefCocoG has 8.4 words per expression while RefCoco has 3.5 words.
Each dataset has different split allocations that are typically all reported in papers. The "testA" and "testB" sets in RefCoco and RefCoco+ contain only people and only non-people respectively. Images are partitioned into the various splits. In the "google" split, objects, not images, are partitioned between the train and non-train splits. This means that the same image can appear in both the train and validation split, but the objects being referred to in the image will be different between the two sets. In contrast, the "unc" and "umd" splits partition images between the train, validation, and test split. In RefCocoG, the "google" split does not have a canonical test set, and the validation set is typically reported in papers as "val*".
Stats for each dataset and split ("refs" is the number of referring expressions, and "images" is the number of images):
dataset | partition | split | refs | images |
---|---|---|---|---|
refcoco | train | 40000 | 19213 | |
refcoco | val | 5000 | 4559 | |
refcoco | test | 5000 | 4527 | |
refcoco | unc | train | 42404 | 16994 |
refcoco | unc | val | 3811 | 1500 |
refcoco | unc | testA | 1975 | 750 |
refcoco | unc | testB | 1810 | 750 |
refcoco+ | unc | train | 42278 | 16992 |
refcoco+ | unc | val | 3805 | 1500 |
refcoco+ | unc | testA | 1975 | 750 |
refcoco+ | unc | testB | 1798 | 750 |
refcocog | train | 44822 | 24698 | |
refcocog | val | 5000 | 4650 | |
refcocog | umd | train | 42226 | 21899 |
refcocog | umd | val | 2573 | 1300 |
refcocog | umd | test | 5023 | 2600 |
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('ref_coco', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/ref_coco-refcoco_unc-1.1.0.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a large-scale graph dataset of materials science based on the Open Quantum Materials Database (OQMD) v1.5 .
Data Loading
A Python code example:
import sys
sys.path.append('/your/path/to/data/OQM9HK_BEL')
import OQM9HK
bel_path='/your/path/to/data/OQM9HK_BEL'
config = OQM9HK.load_config(path=bel_path)
print(config['atomic_numbers'])
split = OQM9HK.load_split(path=bel_path)
print(len(split['train']), len(split['val']), len(split['test']))
graph_data = OQM9HK.load_graph_data(path=bel_path)
name = next(iter(graph_data)) # Frist entry's name
graph = graph_data[name] # Graph object
print(graph.nodes)
print(graph.edge_sources)
print(graph.edge_targets)
dataset = OQM9HK.load_targets(path=bel_path) # Pandas dataframe
print(dataset)
train_set = dataset.iloc[split['train']]
val_set = dataset.iloc[split['val']]
test_set = dataset.iloc[split['test']]
The Balanced Affectnet Dataset is a uniformly processed, class-balanced, and augmented version of the affect-fer composite dataset. This curated version is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to enhance model performance and comparability.
🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.
🧾 Dataset Characteristics Source: Based on the Affectnet dataset
Image Format: RGB .png
Image Size: 75 × 75 pixels
Emotion 8-Classes: Anger Contempt disgust fear happy neutral sad surprise
Total Images: 41,008
Images per Class: 5,126
⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:
✅ Converted to grayscale
✅ Resized to 75×75 pixels
✅ Augmented using:
Random rotation
Horizontal flip
Brightness adjustment
Contrast enhancement
Sharpness modification
This results in a clean, uniform, and diverse dataset ideal for FER tasks.
Testing (10%): 4100 images
Training (80% of remainder): 29526 images
Validation (20% of remainder): 7,382 images
✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions
🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead
🚀 Augmented: Improves model generalization and robustness
📦 Split Ready: Train/Val/Test folders structured per class
📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MS COCO (Microsoft Common Objects in Context) dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 164K images.
This is the original version from 2014 made available here for easy access in Kaggle and because it does not seem to be still available on the COCO Dataset website. This has been retrieved from the mirror that Joseph Redmon has setup on this own website.
The 2014 version of the COCO dataset is an excellent object detection dataset with 80 classes, 82,783 training images and 40,504 validation images. This dataset contains all this imagery on two folders as well as the annotation with the class and location (bounding box) of the objects contained in each image.
The initial split provides training (83K), validation (41K) and test (41K) sets. Since the split between training and validation was not optimal in the original dataset, there is also two text (.part) files with a new split with only 5,000 images for validation and the rest for training. The test set has no labels and can be used for visual validation or pseudo-labelling.
This is mostly inspired by Erik Linder-Norén and [Joseph Redmon](https://pjreddie.com/darknet/yolo
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is composed of 300 instances from the 100 most important classes in Wikidata, for a total of around 30000 entities and 390000 triples. The dataset is geared towards knowledge graph refinement models that leverage edit history information from the graph. There are two versions of the dataset:
The static version (files postfixed with '_static') contains the simple statements of each entity fetched from Wikidata.
The dynamic version (files postfixed with '_dynamic') contains information about the operations and revisions made to these entities, and the triples that were added or removed.
Each version is split into three subsets: train, validation (val), and test. Each split contains every entity from the dataset. The train split contains the first 70% of revisions made to each entity, the validation split contains the 70% to 85% revisions, and the test set contains the last 15% revisions.
This is a sample from the static datasets:
wd:Q217432 a uo:entity ; wdt:P1082 1.005904e+06 ; wdt:P1296 "0052280" ; wdt:P1791 wd:Q18704103 ; wdt:P18 "Pitakwa.jpg" ; wdt:P244 "n80066826" ; wdt:P571 "+1912-00-00T00:00:00Z" ; wdt:P6766 "421180027" .
Each entity has the type uo:entity, and contains the statements added during that time period following Wikidata's data model.
In the following code snippet we show an example from the dynamic dataset:
uo:rev703872813 a uo:revision ; uo:timestamp "2018-06-28T22:31:32Z" .
uo:op703872813_0 a uo:operation ; uo:fromRevision uo:rev703872813 ; uo:newObject wd:Q82955 ; uo:opType uo:add ; uo:revProp wdt:P106 ; uo:revSubject wd:Q6097419 .
uo:op703878666_0 a uo:operation ; uo:fromRevision uo:rev703878666 ; uo:opType uo:remove ; uo:prevObject wd:Q1108445 ; uo:revProp wdt:P460 ; uo:revSubject wd:Q1147883 .
This dataset is composed of revisions, which have a timestamp. Each revision is composed of 1 to n operations, in which there is a change to a statement from the entity. There are two types of operations: uo:add and uo:remove. In both cases, the property and the subject being modified are shown with the uo:revProp and uo:revSubject properties. In the case of additions, uo:newObject and uo:prevObject properties are added to show the previous and new objects after the addition. In the case of removals, there is a uo:prevObject property to record the object that was removed.
Introduction The data set is based on 3,004 images collected by the Pancam instruments mounted on the Opportunity and Spirit rovers from NASA's Mars Exploration Rovers (MER) mission. We used rotation, skewing, and shearing augmentation methods to increase the total collection to 70,864 (see Image Augmentation section for more information). Based on the MER Data Catalog User Survey [1], we identified 25 classes of both scientific (e.g. soil trench, float rocks, etc.) and engineering (e.g. rover deck, Pancam calibration target, etc.) interests (see Classes section for more information). The 3,004 images were labeled on Zooniverse platform, and each image is allowed to be assigned with multiple labels. The images are either 512 x 512 or 1024 x 1024 pixels in size (see Image Sampling section for more information). Classes There is a total of 25 classes for this data set. See the list below for class names, counts, and percentages (the percentages are computed as count divided by 3,004). Note that the total counts don't sum up to 3,004 and the percentages don't sum up to 1.0 because each image may be assigned with more than one class. Class name, count, percentage of dataset Rover Deck, 222, 7.39% Pancam Calibration Target, 14, 0.47% Arm Hardware, 4, 0.13% Other Hardware, 116, 3.86% Rover Tracks, 301, 10.02% Soil Trench, 34, 1.13% RAT Brushed Target, 17, 0.57% RAT Hole, 30, 1.00% Rock Outcrop, 1915, 63.75% Float Rocks, 860, 28.63% Clasts, 1676, 55.79% Rocks (misc), 249, 8.29% Bright Soil, 122, 4.06% Dunes/Ripples, 1000, 33.29% Rock (Linear Features), 943, 31.39% Rock (Round Features), 219, 7.29% Soil, 2891, 96.24% Astronomy, 12, 0.40% Spherules, 868, 28.89% Distant Vista, 903, 30.23% Sky, 954, 31.76% Close-up Rock, 23, 0.77% Nearby Surface, 2006, 66.78% Rover Parts, 301, 10.02% Artifacts, 28, 0.93% Image Sampling Images in the MER rover Pancam archive are of sizes ranging from 64x64 to 1024x1024 pixels. The largest size, 1024x1024, was by far the most common size in the archive. For the deep learning dataset, we elected to sample only 1024x1024 and 512x512 images as the higher resolution would be beneficial to feature extraction. In order to ensure that the data set is representative of the total image archive of 4.3 million images, we elected to sample via "site code". Each Pancam image has a corresponding two-digit alphanumeric "site code" which is used to track location throughout its mission. Since each "site code" corresponds to a different general location, sampling a fixed proportion of images taken from each site ensure that the data set contained some images from each location. In this way, we could ensure that a model performing well on this dataset would generalize well to the unlabeled archive data as a whole. We randomly sampled 20% of the images at each site within the subset of Pancam data fitting all other image criteria, applying a floor function to non-whole number sample sizes, resulting in a dataset of 3,004 images. Train/validation/test sets split The 3,004 images were split into train, validation, and test data sets. The split was done so that roughly 60, 15, and 25 percent of the 3,004 images would end up as train, validation, and test data sets respectively, while ensuing that images from a given site are not split between train/validaiton/test data sets. This resulted in 1,806 train images, 456 validation images, and 742 test images. Augmentation To augment the images in train and validation data sets (note that images in the test data set were not augmented), three augmentation methods were chosen that best represent transformations that could be realistically seen in Pancam images. The three augmentations methods are rotation, skew, and shear. The augmentation methods were applied with random magnitude, followed by a random horizontal flipping, to create 30 augmented images for each image. Since each transformation is followed by a square crop in order to keep input shape consistent, we had to constrict the magnitude limits of each augmentation to avoid cropping out important features at the edges of input images. Thus, rotations were limited to 15 degrees in either direction, the 3-dimensional skew was limited to 45 degrees in any direction, and shearing was limited to 10 degrees in either direction. Note that augmentation was done only on training and validation images. Directory Contents images: contains all 70,864 images train-set-v1.1.0.txt: label file for the training data set val-set-v1.1.0.txt: label file for the validation data set test-set-v1.1.0.txt: label file for the testing data set Images with relatively short file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg) are original images, and images with long file names (e.g., 1p128287181mrd0000p2303l2m1.img.jpg_04140167-5781-49bd-a913-6d4d0a61dab1.jpg) are augmented images. The label files are formatted as "Image name, Class1, Class2, ..., ClassN". Reference [1] S.B. Cole, J.C. Aubele, B.A. Cohen, S.M. Milkovich, and S.A...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
.db versions of the train/test/val splits as described by https://gitlab.com/matschreiner/Transition1x.For Fall 2023 CS224W Final Project
Dataset Card for ncRPI
Summary
The ncRPI dataset is part of the LUCAONE downstream tasks collection for biomolecular interaction prediction. It is structured for binary classification and includes standard splits for training (train.csv), validation (dev.csv → val), and test (test.csv).
Dataset Structure
This dataset includes three splits:
train val (converted from dev.csv) test
Each split is in CSV format.
Task
Binary classification of… See the full description on the dataset page: https://huggingface.co/datasets/vladak/ncRPI.
The Balanced Image-FER Dataset is a uniformly processed, class-balanced, and augmented version of the original FER2013 Dataset Emotion Dataset. This dataset is tailored for deep learning and machine learning applications in Facial Emotion Recognition (FER). It addresses class imbalance and standardizes input dimensions to boost model performance and ensure fair evaluation across classes.
🎯 Purpose The goal of this dataset is to balance the representation of seven basic emotions, enabling the training of fairer and more robust FER models. Each emotion class contains an equal number of images, facilitating consistent model learning and evaluation across all classes.
🧾 Dataset Characteristics Source: Based on the Image-FER Dataset
Image Format: RGB .png
Image Size: 75 × 75 pixels
Emotion Classes:
anger contempt disgust fear happiness neutral sadness surprise
⚙️ Preprocessing Pipeline Each image in the dataset has been preprocessed using the following steps:
✅ Converted to RGB
✅ Resized to 75×75 pixels
✅ Augmented using:
Random rotation
Horizontal flip
Brightness adjustment
Contrast enhancement
Sharpness modification
This results in a clean, uniform, and diverse dataset ideal for FER tasks.
Testing (10%): 687 images
Training (80% of remainder): 4,943 images
Validation (20% of remainder): 1,236 images
✅ Advantages ⚖️ Balanced Classes: Equal images across all seven emotions
🧠 Model-Friendly: Grayscale, resized format reduces preprocessing overhead
🚀 Augmented: Improves model generalization and robustness
📦 Split Ready: Train/Val/Test folders structured per class
📊 Great for Benchmarking: Ideal for training CNNs, Transformers, and ensemble models for FER
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview: This dataset contains 3000+ images generated from OOC (organ-on-a-chip) setup with different cell types. The images were generated by an automated brightfield microscopy setup; for each image, such parameters as cell type, time after seeding, and class label ('good' or 'bad' sample quality as assessed by a biology expert) are provided. Furthermore, for some images, seeding density and flow rate are given as well. The dataset can be used for training machine learning classifiers for the automated analysis of the data generated with OOC setup, allowing to create more reliable tissue models and automate decision making processes for growing OOC.
The dataset comprises images of OOC samples from the following cell lines:
Structure of the dataset: The dataset is split into three main folders that correspond to the data split for training machine learning models, i.e., 'train', 'val', and 'test'. The train/val/test split is done proportionally with respect to the class labels, cell lines, and time after seeding (see below), yet the data can be split or merged in other ways to suit the needs of prospective users of the dataset. Within each of the main folders, there are a 'bad' and a 'good' folder with the images corresponding to the respective class labels (see 'Overview' above). The images in 'bad' / 'ģood' folders are further subdivided into folders corresponding to respective cell lines, which are in their turn subdivided into folders corresponding to the different times after seeding. Therefore, it is easy to find images of interest, e.g., '4+ days' 'good' images of the cell line A549 from the 'train' dataset. Further information about the images is available in the file 'OOC_datasheet.xlsx'.
Acknowledgement: The work presented in this paper was supported by the project 'AI-improved organ on chip cultivation for personalised medicine (AimOOC)' (contract with Central Finance and Contracting Agency of Republic of Latvia no. 1.1.1.1/21/A/079; the project is co-financed by REACT-EU funding for mitigating the consequences of the pandemic crisis).
The data is based on andomly selected viral and bacterial genomes and the human193(GRCh38.p13) reference genome which were downloaded from GenBank. From each original nucleic acid sequences we created mutliple patches of length 300 in all possible reading frames using a sliding window on the initial sequence and its reversed complement. For the train and val file, the resulting patches are translated to amino acid sequences of length 100 where the DNA_test file contains the nucleic acid sequences patches of length 300. The data is stored in the FASTA format according to the following convention: >{ID}_subsequence{patch index}_frame{frame index}|{class marker}|{frame index} sequence with ID - denotes the ReSeq accession of the original sequence in the Refseq dataset. sequence - either nucleic acid sequence patch of length 300 (DNA_test) or amino acid sequence of length 100 (train, val) patch_index - denotes the starting triplet of the given patch within the original sequence or reverse complemented sequence (i.e. 3*patch_index is the starting index of frame 0 in the original sequence) class marker - indicates the taxonomic domain 0 - virus 1 - bacteria 2 - human / mammal frame index - indicates the reading frame 0 - on-frame 1 - shifted by one 2 - shifted by two 3 - reverse complemented 4 - shifted by one and reverse complemented 5 - shifted by two and reverse complemented The data is split into test, training and validation set which contain the following number of patches per frame: - train: 1.700.944 - test: 212.618 - val: 212.618 The authors acknowledge the financial support by the Federal Ministry of Education and Research of Germany (BMBF) in the project deep.Health (project number 13FH770IX6).
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Dataset Division:
Unique_query : contains unique queries that are not present in train, val, test split. Used for testing unseen understandability of models. Train_all : contains the unsplit train, test, val datapoints. train : train split val : val split test : test split