100+ datasets found

T
imagenet2012
tensorflow.org
Updated Jun 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
Explore at:
Dataset updated
Jun 1, 2024
Description
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.

Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.

Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650 363 691 764 923 427 737 369 430 531 124 755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('imagenet2012', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
O
Data from: Split Phase Inverter Data
data.openei.org
gimi9.com
+3more
data
Updated Mar 23, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Prabakar; Ganguly; Velaga; Vaidhynathan; Prabakar; Ganguly; Velaga; Vaidhynathan (2023). Split Phase Inverter Data [Dataset]. https://data.openei.org/submissions/8264
Explore at:
dataAvailable download formats
Dataset updated
Mar 23, 2023
Dataset provided by
USDOE Office of Energy Efficiency and Renewable Energy (EERE), Multiple Programs (EE)
Open Energy Data Initiative (OEDI)
National Renewable Energy Laboratory
Authors
Prabakar; Ganguly; Velaga; Vaidhynathan; Prabakar; Ganguly; Velaga; Vaidhynathan
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The increase in power electronic based generation sources require accurate modeling of inverters. Accurate modeling requires experimental data over wider operation range. We used 8.35 kW off-the-shelf grid following split phase PV inverter in the experiments. We used controllable AC supply and controllable DC supply to emulate AC and DC side characteristics. The experiments were performed at NREL's Energy Systems Integration Facility. Inverter is tested under 100%, 75%, 50%, 25% load conditions. In the first dataset, for each operating condition, controllable AC source voltage is varied from 0.9 to 1.1 per unit (p.u) with a step value of 0.025 p.u while keeping the frequency at 60 Hz. In the second dataset, under similar load conditions (100%, 75%, 50%, 25% ), the frequency of the controllable AC source voltage was varied from 59 Hz to 61 Hz with a step value of 0.2 Hz. Voltage and frequency range is chosen based on inverter protection. Voltages and currents on DC and AC side are included in the dataset.
h
Caltech-101
huggingface.co
Updated Nov 14, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Dong-Hyun Han (2024). Caltech-101 [Dataset]. https://huggingface.co/datasets/Donghyun99/Caltech-101
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Nov 14, 2024
Authors
Dong-Hyun Han
Description
Dataset Card for "Caltech-101"

This is a non-official Caltech-101 dataset for fine-grained Image Classification. Since there is no official method for separating training and test data, we arbitrarily split the data similar to TensorFlow.If you want to download the official dataset, please refer to the here.
Z
Simulated datasets for detector and particle flow reconstruction: CLIC...
data.niaid.nih.gov
nde-dev.biothings.io
+1more
Updated Mar 21, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Pata, Joosep; Mokhtar, Farouk; Zhang, Mengke; Wulff, Eric; Garcia, Dolores; Kagan, Michael; Duarte, Javier (2025). Simulated datasets for detector and particle flow reconstruction: CLIC detector, machine learning format [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8409591
Explore at:
Dataset updated
Mar 21, 2025
Dataset provided by
CERN
UCSD
SLAC National Accelerator Laboratory
KBFI
Authors
Pata, Joosep; Mokhtar, Farouk; Zhang, Mengke; Wulff, Eric; Garcia, Dolores; Kagan, Michael; Duarte, Javier
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Synopsis

Machine-learning friendly format of tracks, clusters and target particles in electron-positron events, simulated with the CLIC detector. Ready to be used with jpata/particleflow:v2.3.0. Derived from the EDM4HEP ROOT files in https://zenodo.org/record/8260741.

clic_edm_ttbar_pf.zip: e+e- -> ttbar, center of mass energy at 380 GeV

clic_edm_qq_pf.zip: e+e- -> Z* -> qqbar, center of mass energy at 380 GeV

clic_edm_ww_fullhad_pf.zip: e+e- -> WW -> W decaying hadronically, center of mass energy at 380 GeV

clic-tfds.ipynb: an example notebook on how to load the files

Contents

Each .zip file contains the dataset in the tensorflow-datasets, array_record format. We have split the full datasets into 10 subsets, due to space considerations on zenodo, two subsets from each dataset are uploaded. Each dataset contains a train and test split of events.

Dataset semantics (to be updated)

Each dataset consists of events that can be iterated over using the tensorflow-datasets library and used in either tensorflow or pytorch. Each event has the following information available:

X: the reconstruction input features, i.e. tracks and clusters

ytarget: the ground truth particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

ycand: the baseline Pandora PF particles with the features ["PDG", "charge", "pt", "eta", "sin_phi", "cos_phi", "energy", "jet_idx"], with "jet_idx" corresponding to the gen-jet assignment of this particle

The full semantics, including the list of features for X, are available at https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/heptfds/clic_pf_edm4hep/utils_edm.py and https://github.com/jpata/particleflow/blob/v2.3.0/mlpf/data/key4hep/postprocessing.py.
Caltech-256: Pre-Processed 80/20 Train-Test Split
kaggle.com
zip
Updated Nov 12, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
KUSHAGRA MATHUR (2025). Caltech-256: Pre-Processed 80/20 Train-Test Split [Dataset]. https://www.kaggle.com/datasets/kushubhai/caltech-256-train-test
Explore at:
zip(1138799273 bytes)Available download formats
Dataset updated
Nov 12, 2025
Authors
KUSHAGRA MATHUR
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
Context The Caltech-256 dataset is a foundational benchmark for object recognition, containing 30,607 images across 257 categories (256 object categories + 1 clutter category).

The original dataset is typically provided as a collection of directories, one for each category. This version streamlines the machine learning workflow by providing:

A clean, pre-defined 80/20 train-test split.

Manifest files (train.csv, test.csv) that map image paths directly to their labels, allowing for easy use with data generators in frameworks like PyTorch and TensorFlow.

A flat directory structure (train/, test/) for simplified file access.

File Content The dataset is organized into a single top-level folder and two CSV files:

train.csv: A CSV file containing two columns: image_path and label. This file lists all images designated for the training set.

test.csv: A CSV file with the same structure as train.csv, listing all images designated for the testing set.

Caltech-256_Train_Test/: The primary data folder.

train/: This directory contains 80% of the images from all 257 categories, intended for model training.

test/: This directory contains the remaining 20% of the images from all categories, reserved for model evaluation.

Data Split The dataset has been thoroughly partitioned to create a standard 80% training and 20% testing split. This split is (or should be assumed to be) stratified, meaning that each of the 257 object categories is represented in roughly an 80/20 proportion in the respective sets.

Acknowledgements & Original Source This dataset is a derivative work created for convenience. The original data and images belong to the authors of the Caltech-256 dataset.

Original Dataset Link: https://www.kaggle.com/datasets/jessicali9530/caltech256/data

Citation: Griffin, G. Holub, A.D. Perona, P. (2007). Caltech-256 Object Category Dataset. California Institute of Technology.
Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...
data.niaid.nih.gov
zenodo.org
Updated Feb 4, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Gong Chengjuan; Yin Ranyu; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8419699
Explore at:
Dataset updated
Feb 4, 2024
Dataset provided by
Aerospace Information Research Institute, Chinese Academy of Sciences
Authors
Gong Chengjuan; Yin Ranyu; Long Tengfei; He Guojin; Jiao Weili; Wang Guizhou
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

Dataset Structure:

Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

For each sample, it includes:

Sample ID;

Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);

Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;

Ordinal list for each timestamp;

Sample weight list (reserved);

Here is a demonstration function for parsing the TFRecord file:

import tensorflow as tf

init Tensorflow Dataset from file name

def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds

keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }

The Decoder (Optional)

class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()

def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict

simple function

def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight

t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)

#

Class Definition:

0: clear

1: opaque cloud

2: thin cloud

3: haze

4: cloud shadow

5: snow

Dataset Construction:

First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
Anime Subtitles
kaggle.com
zip
Updated Aug 19, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/code
Explore at:
zip(103874640 bytes)Available download formats
Dataset updated
Aug 19, 2021
Authors
Jess Fan
Description
Content

The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

Format

The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

Acknowledgements

A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
The Expanded Groove MIDI Dataset (E-GMD)
kaggle.com
zip
Updated Dec 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alex Ignatov (2023). The Expanded Groove MIDI Dataset (E-GMD) [Dataset]. https://www.kaggle.com/datasets/alexignatov/the-expanded-groove-midi-dataset
Explore at:
zip(107045765 bytes)Available download formats
Dataset updated
Dec 13, 2023
Authors
Alex Ignatov
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
⚠️ Note! This is the MIDI-only archive. If you need the WAV alternatives for your work, please download the full dataset from their website: https://magenta.tensorflow.org/datasets/e-gmd

Cited from the orignal website:

Overview

The Expanded Groove MIDI Dataset (E-GMD) is a large dataset of human drum performances, with audio recordings annotated in MIDI. E-GMD contains 444 hours of audio from 43 drum kits and is an order of magnitude larger than similar datasets. It is also the first human-performed drum transcription dataset with annotations of velocity. It is based on our previously released Groove MIDI Dataset.

Dataset

This dataset is an expansion of the Groove MIDI Dataset (GMD). GMD is a dataset of human drum performances recorded in MIDI format on a Roland TD-11 electronic drum kit. To make the dataset applicable to ADT, we expanded it by re-recording the GMD sequences on 43 drumkits using a Roland TD-17. The kits range from electronic (e.g., 808, 909) to acoustic sounds. Recording was done at 44.1kHz and 24 bits and aligned within 2ms of the original MIDI files.

We maintained the same train, test and validation splits across sequences that GMD had. Because each kit was recorded for every sequence, we see all 43 kits in the train, test and validation splits

Split Unique Sequences Total Sequences Duration (hours)
Train 819 35,217 341.4
Test 123 5,289 50.9
Validation 117 5,031 52.2
Total 1,059 45,537 444.5

Given the semi-manual nature of the pipeline, there were some errors in the recording process that resulted in unusable tracks. If your application requires only symbolic drum data, we recommend using the original data from the Groove MIDI Dataset.

For more information about how the dataset was created and several applications of it, please see the paper where it was introduced: Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset.

Lee Callender, Curtis Hawthorne, and Jesse Engel. "Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset." 2020. arXiv:2004.00188.

For citations, please use: @misc{callender2020improving, title={Improving Perceptual Quality of Drum Transcription with the Expanded Groove MIDI Dataset}, author={Lee Callender and Curtis Hawthorne and Jesse Engel}, year={2020}, eprint={2004.00188}, archivePrefix={arXiv}, primaryClass={cs.SD} }

I have no contribution and affililation with this work - just uploaded it and made available on Kaggle.
reef-cv-strategy-subsequences-dataframes
kaggle.com
zip
Updated Nov 23, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Julián Peller (dataista0) (2021). reef-cv-strategy-subsequences-dataframes [Dataset]. https://www.kaggle.com/julian3833/reef-cv-strategy-subsequences-dataframes
Explore at:
zip(2151552 bytes)Available download formats
Dataset updated
Nov 23, 2021
Authors
Julián Peller (dataista0)
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
See this discussion for a high level overview of these dataframes:

About "the CV problem" - an approach: subsequences

See this notebook for details about the origin of the data:

🐠 Reef - CV strategy: subsequences!

Let's see an example. Consider the sequence A with the following frames: * 1-20 - No annotations present * 21-30 - Annotations present * 31-60 - No annotations * 61-80 - Annotations present

In this case, we say that the sequence A has 4 subsequences (1-20, 21-30, 31-60, 61-80).

A subsequence seems to me like the minimal atom for ensuring no leaks happen between train and test.

For the competition: Tensorflow - Help Protect the Great Barrier Reef
T
wikipedia
tensorflow.org
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
wikipedia [Dataset]. https://www.tensorflow.org/datasets/catalog/wikipedia
Explore at:
Description
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wikipedia', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
Data for: Advances and critical assessment of machine learning techniques...
zenodo.org
dataone.org
+2more
bin, csv
Updated Sep 5, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč (2023). Data for: Advances and critical assessment of machine learning techniques for prediction of docking scores [Dataset]. http://doi.org/10.5061/dryad.zgmsbccg7
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5061/dryad.zgmsbccg7
Dataset updated
Sep 5, 2023
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Lukas Bucinsky; Marián Gall; Marián Gall; Ján Matúška; Michal Pitoňák; Marek Štekláč; Lukas Bucinsky; Ján Matúška; Michal Pitoňák; Marek Štekláč
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
Semi-flexible docking was performed using AutoDock Vina 1.2.2 software on the SARS-CoV-2 main protease M^pro (PDB ID: 6WQF).

Two data sets are provided in the xyz format containing the AutoDock Vina docking scores. These files were used as input and/or reference in the machine learning models using TensorFlow, XGBoost, and SchNetPack to study their docking scores prediction capability. The first data set originally contained 60,411 in-vivo labeled compounds selected for the training of ML models. The second data set,denoted as in-vitro-only, originally contained 175,696 compounds active or assumed to be active at 10 μM or less in a direct binding assay. These sets were downloaded on the 10th of December 2021 from the ZINC15 database. Four compounds in the in-vivo set and 12 in the in-vitro-only set were left out of consideration due to presence of Si atoms. Compounds with no charges assigned in mol2 files were excluded as well (523 compounds in the in-vivo and 1,666 in the in-vitro-only set). Gasteiger charges were reassigned to the remaining compounds using OpenBabel. In addition, four in-vitro-only compounds with docking scores greater than 1 kcal/mol have been rejected.

The provided in-vivo and the in-vitro-only sets contain 59,884 (in-vivo.xyz) and 174,014 (in-vitro-only.xyz) compounds, respectively. Compounds in both sets contain the following elements: H, C, N, O, F, P, S, Cl, Br, and I. The in-vivo compound set was used as the primary data set for the training of the ML models in the referencing study.

The file in-vivo-splits-data.csv contains the exact composition of all (random) 80-5-15 train-validation-test splits used in the study, labeled I, II, III, IV, and V. Eight additional random subsets in each of the in-vivo 80-5-15 splits were created to monitor the training process convergence. These subsets were constructed in such a manner, that each subset contains all compounds from the previous subset (starting with the 10-5-15 subset) and was enlarged by one eighth of the entire (80-5-15) train set of a given split. These subsets are further referred to as in_vivo_10_(I, II, ..., V), in_vivo_20_(I, II, ..., V),..., in_vivo_80_(I, II, ... V).
200000 Medical Research Paper Abstracts
kaggle.com
zip
Updated Jan 14, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Anshul Mehta (2022). 200000 Medical Research Paper Abstracts [Dataset]. https://www.kaggle.com/datasets/anshulmehtakaggl/200000-abstracts-for-seq-sentence-classification/code
Explore at:
zip(251885526 bytes)Available download formats
Dataset updated
Jan 14, 2022
Authors
Anshul Mehta
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I was working with this dataset as a part of a Project for a Tensorflow course that I was taking. It seemed to be a very interesting problem. You can check the course here.

Content

In the training set the Abstracts are split into Objectives, Methods, Conclusions, Results etc for each sentence. The aim is to make sure that our model is able to split the test data or any other abstract as a matter of fact into the same way making the complicated abstracts much easier to read.

Acknowledgements

Inspiration of the Project: https://arxiv.org/abs/1710.06071 Data Belongs to: https://github.com/Franck-Dernoncourt/pubmed-rct
T
wit_kaggle
tensorflow.org
Updated Dec 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2022). wit_kaggle [Dataset]. https://www.tensorflow.org/datasets/catalog/wit_kaggle
Explore at:
Dataset updated
Dec 22, 2022
Description
Wikipedia - Image/Caption Matching Kaggle Competition.

This competition is organized by the Research team at the Wikimedia Foundation in collaboration with Google Research and a few external collaborators. This competition is based on the WIT dataset published by Google Research as detailed in thisSIGIR paper.

In this competition, you’ll build a model that automatically retrieves the text closest to an image. Specifically, you'll train your model to associate given images with article titles or complex captions, in multiple languages. The best models will account for the semantic granularity of Wikipedia images. If successful, you'll be contributing to the accessibility of the largest online encyclopedia. The millions of Wikipedia readers and edietors will be able to more easily understand, search, and describe media at scale. As a result, you’ll contribute to an open model to improve learning for all.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('wit_kaggle', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/wit_kaggle-train_with_extended_features-1.0.2.png" alt="Visualization" width="500px">
Z
Graph topological features extracted from expression profiles of...
data-staging.niaid.nih.gov
Updated Jan 24, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C (2020). Graph topological features extracted from expression profiles of neuroblastoma patients [Dataset]. https://data-staging.niaid.nih.gov/resources?id=zenodo_3357673
Explore at:
Dataset updated
Jan 24, 2020
Dataset provided by
Luxembourg Institute of Health
Nanyang Technological University
Authors
Tranchevent, Léon-Charles; Azuaje, Francisco; Rajapakse, Jagath C
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Introduction

This dataset contains the data described in the paper titled "A deep neural network approach to predicting clinical outcomes of neuroblastoma patients." by Tranchevent, Azuaje and Rajapakse. More precisely, this dataset contains the topological features extracted from graphs built from publicly available expression data (see details below). This dataset does not contain the original expression data, which are available elsewhere. We thank the scientists who did generate and share these data (please see below the relevant links and publications).

Content

File names start with the name of the publicly available dataset they are built on (among "Fischer", "Maris" and "Versteeg"). This name is followed by a tag representing whether they contain raw data ("raw", which means, in this case, the raw topological features) or TF formatted data ("TF", which stands for TensorFlow). This tag is then followed by a unique identifier representing a unique configuration. The configuration file "Global_configuration.tsv" contains details about these configurations such as which topological features are present and which clinical outcome is considered.

The code associated to the same manuscript that uses these data is at https://gitlab.com/biomodlih/SingalunDeep. The procedure by which the raw data are transformed into the TensorFlow ready data is described in the paper.

File format

All files are TSV files that correspond to matrices with samples as rows and features as columns (or clinical data as columns for clinical data files). The data files contain various sets of topological features that were extracted from the sample graphs (or Patient Similarity Networks - PSN). The clinical files contain relevant clinical outcomes.

The raw data files only contain the topological data. For instance, the file "Fischer_raw_2d0000_data_tsv" contains 24 values for each sample corresponding to the 12 centralities computed for both the microarray (Fischer-M) and RNA-seq (Fischer-R) datasets. The TensorFlow ready files do not contain the sample identifiers in the first column. However, they contain two extra columns at the end. The first extra column is the sample weights (for the classifiers and because we very often have a dominant class). The second extra column is the class labels (binary), based on the clinical outcome of interest.

Dataset details

The Fischer dataset is used to train, evaluate and validate the models, so the dataset is split into train / eval / valid files, which contains respectively 249, 125 and 124 rows (samples) of the original 498 samples. In contrast, the other two datasets (Maris and Versteeg) are smaller and are only used for validation (and therefore have no training or evaluation file).

The Fischer dataset also has more data files because various configurations were tested (see manuscript). In contrast, the validation, using the Maris and Versteeg datasets is only done for a single configuration and there are therefore less files.

For Fischer, a few configurations are listed in the global configuration file but there is no corresponding raw data. This is because these items are derived from concatenations of the original raw data (see global configuration file and manuscript for details).

References

This dataset is associated with Tranchevent L., Azuaje F.. Rajapakse J.C., A deep neural network approach to predicting clinical outcomes of neuroblastoma patients.

If you use these data in your research, please do not forget to also cite the researchers who have generated the original expression datasets.

Fischer dataset:

Zhang W. et al., Comparison of RNA-seq and microarray-based models for clinical endpoint prediction. Genome Biology 16(1) (2015). doi:10.1186/s13059-015-0694-1

Wang C. et al., The concordance between RNA-seq and microarray data depends on chemical treatment and transcript abundance. Nat. Biotechnol. 32(9), 926–932. doi:10.1038/nbt.3001

Versteeg dataset:

Molenaar J.J. et al., Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature 483(7391), 589–593. doi:10.1038/nature10910

Maris dataset:

Wang Q. et al., Integrative genomics identifies distinct molecular classes of neuroblastoma and shows that multiple genes are targeted by regional alterations in DNA copy number. Cancer Res. 66(12), 6050–6062. doi:10.1158/0008-5472.CAN-05-4618
Bone Fracture Detection: Computer Vision Project
kaggle.com
zip
Updated Feb 25, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Hina Ismail (2024). Bone Fracture Detection: Computer Vision Project [Dataset]. https://www.kaggle.com/datasets/sonialikhan/bone-fracture-detection-computer-vision-project
Explore at:
zip(43644754 bytes)Available download formats
Dataset updated
Feb 25, 2024
Authors
Hina Ismail
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Building a bone fracture detection system using computer vision involves several steps. Here's a general outline to get you started:

Dataset Collection: Gather a dataset of X-ray images with labeled fractures. You can explore datasets like MURA, NIH Chest X-ray Dataset, or create your own dataset with proper ethical considerations.

Data Preprocessing: Clean and preprocess the X-ray images. This may involve resizing, normalization, and data augmentation to increase the diversity of your dataset.

Model Selection: Choose a suitable pre-trained deep learning model for image classification. Models like ResNet, DenseNet, or custom architectures have shown good performance in medical image analysis tasks.

Transfer Learning: Fine-tune the selected model on your X-ray dataset using transfer learning. This helps leverage the knowledge gained from pre-training on a large dataset.

Model Training: Split your dataset into training, validation, and test sets. Train your model on the training set and validate its performance on the validation set to fine-tune hyperparameters.

Evaluation Metrics: Choose appropriate evaluation metrics such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC) to assess the model's performance.

Post-processing: Implement any necessary post-processing steps, such as non-maximum suppression, to refine the model's output and reduce false positives.

Deployment: Deploy the trained model as part of a computer vision application. This could be a web-based application, mobile app, or integrated into a healthcare system.

Continuous Improvement: Regularly update and improve your model based on new data or advancements in the field. Monitoring its performance in real-world scenarios is crucial.

Ethical Considerations: Ensure that your project follows ethical guidelines and regulations for handling medical data. Implement privacy measures and obtain necessary approvals if you are using patient data.

Tools and Libraries: Python, TensorFlow, PyTorch, Keras for deep learning implementation. OpenCV for image processing. Flask/Django for building a web application. Docker for containerization. GitHub for version control.
T
cifar10
tensorflow.org
opendatalab.com
+3more
Updated Jun 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
Explore at:
Dataset updated
Jun 1, 2024
Description
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('cifar10', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Pre Trained Model For Emotion Detection
kaggle.com
Updated Jan 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Abhishek Singh (2024). Pre Trained Model For Emotion Detection [Dataset]. http://doi.org/10.34740/kaggle/ds/4374471
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.34740/kaggle/ds/4374471
Dataset updated
Jan 30, 2024
Dataset provided by
Kaggle
Authors
Abhishek Singh
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
FER2013 (Facial Expression Recognition 2013) dataset is a widely used dataset for training and evaluating facial expression recognition models. Here are key details about the FER2013 dataset:

Overview:

FER2013 is a dataset designed for facial expression recognition tasks, particularly the classification of facial expressions into seven different emotion categories. The dataset was introduced for the Emotion Recognition in the Wild (EmotiW) Challenge in 2013.

Emotion Categories:

The dataset consists of images labeled with seven emotion categories: Angry, Disgust, Fear, Happy, Sad, Surprise, and Neutral.

Image Size:

Each image in the FER2013 dataset is grayscale and has a resolution of 48x48 pixels.

Number of Images:

The dataset contains a total of 35,887 labeled images, with approximately 5,000 images per emotion category. Partitioning:

FER2013 is often divided into training, validation, and test sets. The original split has 28,709 images for training, 3,589 images for validation, and 3,589 images for testing.

Usage in Research:

FER2013 has been widely used in research for benchmarking and training facial expression recognition models, particularly deep learning models. It provides a standard dataset for evaluating the performance of models on real-world facial expressions. Challenges:

The FER2013 dataset is known for its relatively simple and posed facial expressions. In real-world scenarios, facial expressions can be more complex and spontaneous, and there are datasets addressing these challenges.

Challenges and Criticisms:

Some criticisms of the dataset include its relatively small size, limited diversity in facial expressions, and the fact that some expressions (e.g., "Disgust") are challenging to recognize accurately.

This pre trained machine model implements a Convolutional Neural Network (CNN) for emotion detection using the TensorFlow and Keras frameworks. The model architecture includes convolutional layers, batch normalization, and dropout for effective feature extraction and classification. The training process utilizes an ImageDataGenerator for data augmentation, enhancing the model's ability to generalize to various facial expressions.

Key Steps:

Model Training: The CNN model is trained on an emotion dataset using an ImageDataGenerator for dynamic data augmentation. Training is performed over a specified number of epochs with a reduced batch size for efficient learning.

Model Checkpoint: ModelCheckpoint is employed to save the best-performing model during training, ensuring that the most accurate model is retained.

Save Model and Memory Cleanup: The trained model is saved in both HDF5 and JSON formats. Memory is efficiently managed by deallocating resources, clearing the Keras session, and performing garbage collection.
T
speech_commands
tensorflow.org
datasets.activeloop.ai
+1more
Updated Jan 13, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2023). speech_commands [Dataset]. http://identifiers.org/arxiv:1804.03209
Explore at:
Unique identifier
https://identifiers.org/arxiv:1804.03209
Dataset updated
Jan 13, 2023
Description
An audio dataset of spoken words designed to help train and evaluate keyword spotting systems. Its primary goal is to provide a way to build and test small models that detect when a single word is spoken, from a set of ten target words, with as few false positives as possible from background noise or unrelated speech. Note that in the train and validation set, the label "unknown" is much more prevalent than the labels of the target words or background noise. One difference from the release version is the handling of silent segments. While in the test set the silence segments are regular 1 second files, in the training they are provided as long segments under "background_noise" folder. Here we split these background noise into 1 second clips, and also keep one of the files for the validation set.

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('speech_commands', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
rlu_atari_checkpoints_ordered
tensorflow.org
Updated Dec 9, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). rlu_atari_checkpoints_ordered [Dataset]. https://www.tensorflow.org/datasets/catalog/rlu_atari_checkpoints_ordered
Explore at:
Dataset updated
Dec 9, 2021
Description
RL Unplugged is suite of benchmarks for offline reinforcement learning. The RL Unplugged is designed around the following considerations: to facilitate ease of use, we provide the datasets with a unified API which makes it easy for the practitioner to work with all data in the suite once a general pipeline has been established.

The datasets follow the RLDS format to represent steps and episodes.

We are releasing a large and diverse dataset of gameplay following the protocol described by Agarwal et al., 2020, which can be used to evaluate several discrete offline RL algorithms. The dataset is generated by running an online DQN agent and recording transitions from its replay during training with sticky actions Machado et al., 2018. As stated in Agarwal et al., 2020, for each game we use data from five runs with 50 million transitions each. We release datasets for 46 Atari games. For details on how the dataset was generated, please refer to the paper. Please see this note about the ROM versions used to generate the datasets.

Atari is a standard RL benchmark. We recommend you to try offline RL methods on Atari if you are interested in comparing your approach to other state of the art offline RL methods with discrete actions.

The reward of each step is clipped (obtained with [-1, 1] clipping) and the episode includes the sum of the clipped reward per episode.

Each of the configurations is broken into splits. Splits correspond to checkpoints of 1M steps (note that the number of episodes may difer). Checkpoints are ordered in time (so checkpoint 0 ran before checkpoint 1).

Episodes within each split are ordered. Check https://www.tensorflow.org/datasets/determinism if you want to ensure that you read episodes in order.

This dataset corresponds to the one used in the DQN replay paper. https://research.google/tools/datasets/dqn-replay/

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('rlu_atari_checkpoints_ordered', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.
T
Data from: dolma
tensorflow.org
Updated Mar 14, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2025). dolma [Dataset]. https://www.tensorflow.org/datasets/catalog/dolma
Explore at:
Dataset updated
Mar 14, 2025
Description
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

To use this dataset:

import tensorflow_datasets as tfds ds = tfds.load('dolma', split='train') for ex in ds.take(4): print(ex)

See the guide for more informations on tensorflow_datasets.

Split	Unique Sequences	Total Sequences	Duration (hours)
Train	819	35,217	341.4
Test	123	5,289	50.9
Validation	117	5,031	52.2
Total	1,059	45,537	444.5

Facebook

Twitter

Click to copy link

Link copied

Cite

(2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012

imagenet2012

Explore at:

Dataset updated

Jun 1, 2024

Description

ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

Download the 2012 test split available here.
Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

The resulting tar-ball may then be processed by TFDS.

To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168

The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

To use this dataset:

import tensorflow_datasets as tfds

ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
 print(ex)

See the guide for more informations on tensorflow_datasets.

https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

Clear search

Close search

Google apps

Main menu

imagenet2012

Data from: Split Phase Inverter Data

Caltech-101

Simulated datasets for detector and particle flow reconstruction: CLIC...

Caltech-256: Pre-Processed 80/20 Train-Test Split

Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...

init Tensorflow Dataset from file name

The Decoder (Optional)

simple function

Anime Subtitles

Content

Format

Acknowledgements

The Expanded Groove MIDI Dataset (E-GMD)

Overview

Dataset

reef-cv-strategy-subsequences-dataframes

See this discussion for a high level overview of these dataframes:

About "the CV problem" - an approach: subsequences

See this notebook for details about the origin of the data:

🐠 Reef - CV strategy: subsequences!

For the competition: Tensorflow - Help Protect the Great Barrier Reef

wikipedia

Data for: Advances and critical assessment of machine learning techniques...

200000 Medical Research Paper Abstracts

Context

Content

Acknowledgements

wit_kaggle

Graph topological features extracted from expression profiles of...

Bone Fracture Detection: Computer Vision Project

cifar10

Pre Trained Model For Emotion Detection

speech_commands

rlu_atari_checkpoints_ordered

Data from: dolma

imagenet2012See More Versions

imagenet2012