Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).
Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.
Dataset Structure:
Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.
For each sample, it includes:
Sample ID;
Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);
Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;
Ordinal list for each timestamp;
Sample weight list (reserved);
Here is a demonstration function for parsing the TFRecord file:
import tensorflow as tf
def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds
keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }
class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()
def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict
def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight
t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)
#
Class Definition:
0: clear
1: opaque cloud
2: thin cloud
3: haze
4: cloud shadow
5: snow
Dataset Construction:
First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.
Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.
The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('cifar10', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">
Here is a demo of how to parse the examples and how to convert the RLEs to Masks.
All labeled TFRecs, even those for examples with no defects, contain the rle feature. The rle feature is a list of 4 strings. Empty masks (i.e. mask channels) have been encoded as '1 0'
so that they can be processed the same way as non-empty masks if needed, otherwise encodings are the same as in the original competition dataset.
features = {
'image': tf.io.FixedLenFeature([], tf.string),
'img_id': tf.io.FixedLenFeature([], tf.string),
'height': tf.io.FixedLenFeature([], tf.int64),
'width': tf.io.FixedLenFeature([], tf.int64),
}
if labeled:
features['rle'] = tf.io.FixedLenFeature([4], tf.string)
features['label'] = tf.io.FixedLenFeature([], tf.int64)
IMAGE_SIZE = (256, 1600) # original image & mask size
N_CHANNELS = 3 # original image channels
N_CLASSES = 4 # channels of the mask
def rle2mask(rle, image_size = IMAGE_SIZE):
size = tf.math.reduce_prod(image_size)
s = tf.strings.split(rle)
s = tf.strings.to_number(s, tf.int32)
starts = s[0::2] - 1
lens = s[1::2]
total_ones = tf.reduce_sum(lens)
ones = tf.ones([total_ones], tf.int32)
r = tf.range(total_ones)
lens_cum = tf.math.cumsum(lens)
s = tf.searchsorted(lens_cum, r, 'right')
idx = r + tf.gather(starts - tf.pad(lens_cum[:-1], [(1, 0)]), s)
mask_flat = tf.scatter_nd(tf.expand_dims(idx, 1), ones, [size])
mask = tf.reshape(mask_flat, (image_size[1], image_size[0]))
return tf.transpose(mask)
def build_masks(rles, input_shape = IMAGE_SIZE, n_classes=N_CLASSES):
masks = [rle2mask(rles[i]) for i in range(N_CLASSES)]
masks = tf.stack(masks, axis = -1)
return masks
ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:
The resulting tar-ball may then be processed by TFDS.
To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.
To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:
771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168
The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
The MNIST database of handwritten digits.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In this study, we investigate the feasibility of improving the imaging quality for low-dose multislice helical computed tomography (CT) via iterative reconstruction with tensor framelet (TF) regularization. TF based algorithm is a high-order generalization of isotropic total variation regularization. It is implemented on a GPU platform for a fast parallel algorithm of X-ray forward band backward projections, with the flying focal spot into account. The solution algorithm for image reconstruction is based on the alternating direction method of multipliers or the so-called split Bregman method. The proposed method is validated using the experimental data from a Siemens SOMATOM Definition 64-slice helical CT scanner, in comparison with FDK, the Katsevich and the total variation (TV) algorithm. To test the algorithm performance with low-dose data, ACR and Rando phantoms were scanned with different dosages and the data was equally undersampled with various factors. The proposed method is robust for the low-dose data with 25% undersampling factor. Quantitative metrics have demonstrated that the proposed algorithm achieves superior results over other existing methods.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.
Directory is organized into 4 subfolders, each tar'ed and gzipped:
data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage
baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage
chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage
modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models
mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves
The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('oxford_iiit_pet', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)
This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.
This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.
V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)
The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such:
def dataset_fn_local(split, shuffle_files=False):
global nq_tsv_path
del shuffle_files
# Load lines from the text file as examples.
files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)]
print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files.
First 10: {files_to_read[0:10]}")
ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0))
ds = ds.shuffle(buffer_size=600000)
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.
This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data
This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('food101', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes:
* labels
column of the initial train.csv
DataFrame was binarized to multi-label format columns: complex
, frog_eye_leaf_spot
, healthy
, powdery_mildew
, rust
, and scab
* images were scaled to 512x512
* 77 duplicate images having different labels were removed (see the context in this notebook)
* samples were stratified and split into 5 folds (see corresponding folders fold_0
:fold_4
)
* each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)
I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.
For a complete example see my TPU Training Notebook
train.csv
folds.csv
fold_0
:fold_4
folders containing 64 .tfrec
files, respectively, with feature map shown below:
feature_map = {
'image': tf.io.FixedLenFeature([], tf.string),
'name': tf.io.FixedLenFeature([], tf.string),
'complex': tf.io.FixedLenFeature([], tf.int64),
'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64),
'healthy': tf.io.FixedLenFeature([], tf.int64),
'powdery_mildew': tf.io.FixedLenFeature([], tf.int64),
'rust': tf.io.FixedLenFeature([], tf.int64),
'scab': tf.io.FixedLenFeature([], tf.int64)}
### Acknowledgementshttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes:
* labels
column of the initial train.csv
DataFrame was binarized to multi-label format columns: complex
, frog_eye_leaf_spot
, healthy
, powdery_mildew
, rust
, and scab
* images were scaled to 512x512
* 77 duplicate images having different labels were removed (see the context in this notebook)
* samples were stratified and split into 5 folds (see corresponding folders fold_0
:fold_4
)
* images were heavily augmented with albumentations
library (for raw images see this dataset)
* each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)
I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.
For a complete example see my TPU Training Notebook
train.csv
folds.csv
fold_0
:fold_4
folders containing 64 .tfrec
files, respectively, with feature map shown below:
feature_map = {
'image': tf.io.FixedLenFeature([], tf.string),
'name': tf.io.FixedLenFeature([], tf.string),
'complex': tf.io.FixedLenFeature([], tf.int64),
'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64),
'healthy': tf.io.FixedLenFeature([], tf.int64),
'powdery_mildew': tf.io.FixedLenFeature([], tf.int64),
'rust': tf.io.FixedLenFeature([], tf.int64),
'scab': tf.io.FixedLenFeature([], tf.int64)}
### AcknowledgementsAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is derived from the Leaf repository (https://github.com/TalwalkarLab/leaf) pre-processing of the Extended MNIST dataset, grouping examples by writer. Details about Leaf were published in "LEAF: A Benchmark for Federated Settings" https://arxiv.org/abs/1812.01097Note: This dataset does not include some additional preprocessing that MNIST includes, such as size-normalization and centering. In the Federated EMNIST data, the value of 1.0 corresponds to the background, and 0.0 corresponds to the color of the digits themselves; this is the inverse of some MNIST representations, e.g. in tensorflow_datasets, where 0 corresponds to the background color, and 255 represents the color of the digit.Data set sizes:only_digits=True: 3,383 users, 10 label classestrain: 341,873 examplestest: 40,832 examplesonly_digits=False: 3,400 users, 62 label classestrain: 671,585 examplestest: 77,483 examplesRather than holding out specific users, each user's examples are split across train and test so that all users have at least one example in train and one example in test. Writers that had less than 2 examples are excluded from the data set.The tf.data.Datasets returned by tff.simulation.datasets.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration, with the following keys and values, in lexicographic order by key:'label': a tf.Tensor with dtype=tf.int32 and shape [1], the class label of the corresponding pixels. Labels [0-9] correspond to the digits classes, labels [10-35] correspond to the uppercase classes (e.g., label 11 is 'B'), and labels [36-61] correspond to the lowercase classes (e.g., label 37 is 'b').'pixels': a tf.Tensor with dtype=tf.float32 and shape [28, 28], containing the pixels of the handwritten digit, with values in the range [0.0, 1.0].Argsonly_digits(Optional) whether to only include examples that are from the digits [0-9] classes. If False, includes lower and upper case characters, for a total of 62 class labels.cache_dir(Optional) directory to cache the downloaded file. If None, caches in Keras' default cache directory.ReturnsTuple of (train, test) where the tuple elements are tff.simulation.datasets.ClientData objects.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Cell-type specific gene expression is regulated by the combinatorial action of transcription factors (TFs). In this study, we predict transcription factor (TF) combinations that cooperatively bind in a cell-type specific manner. We first divide DNase hypersensitive sites into cell-type specifically open vs. ubiquitously open sites in 64 cell types to describe possible cell-type specific enhancers. Based on the pattern contrast between these two groups of sequences we develop “co-occurring TF predictor on Cell-Type specific Enhancers” (coTRaCTE) - a novel statistical method to determine regulatory TF co-occurrences. Contrasting the co-binding of TF pairs between cell-type specific and ubiquitously open chromatin guarantees the high cell-type specificity of the predictions. coTRaCTE predicts more than 2000 co-occurring TF pairs in 64 cell types. The large majority (70%) of these TF pairs is highly cell-type specific and overlaps in TF pair co-occurrence are highly consistent among related cell types. Furthermore, independently validated co-occurring and directly interacting TFs are significantly enriched in our predictions. Focusing on the regulatory network derived from the predicted co-occurring TF pairs in embryonic stem cells (ESCs) we find that it consists of three subnetworks with distinct functions: maintenance of pluripotency governed by OCT4, SOX2 and NANOG, regulation of early development governed by KLF4, STAT3, ZIC3 and ZNF148 and general functions governed by MYC, TCF3 and YY1. In summary, coTRaCTE predicts highly cell-type specific co-occurring TFs which reveal new insights into transcriptional regulatory mechanisms.
40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.
To use for e.g. character modelling:
d = tfds.load(name='tiny_shakespeare')['train']
d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
# train split includes vocabulary for other splits
vocabulary = sorted(set(next(iter(d)).numpy()))
d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
d = d.unbatch()
seq_len = 100
batch_size = 2
d = d.batch(seq_len)
d = d.batch(batch_size)
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('tiny_shakespeare', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For correlating eigenvectors we interpolated expression values in both cases to 11 equidistant points, concatenated 2 sets of profiles and performed PCA. Then the 22-point eigenvectors were split onto 2 parts, corresponding to mESC and hESC points, and compared. The first 2 eigenvectors were semi quantitatively similar; therefore eigenvector 1 was also compared to eigenvector 2.Correlation (quantified by Pearson correlation coefficient) of eigenvectors from Principal Component Analysis (PCA) at different time intervals of hESC development with eigenvectors of 0–10 hour mESC development.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Electromyograms (EMG) measures muscle responses as electrical activity to neural stimulation, and they can be use to diagnose certain muscular dystrophies and neuropathies. EMG consists of single-channel EMG recording from the tibialis anterior muscle of three volunteers that are healthy, suffering from neuropathy, and suffering from myopathy, respectively. The recordings are sampled with the frequency of 4K Hz. Each patient, i.e., their disorder, is a separate classification category. Then the recordings are split into time series samples using a fixed-length window of 1,500 observations.
Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imdb_reviews', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Scan parameters with different voltages.
Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('fashion_mnist', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).
Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.
Dataset Structure:
Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.
For each sample, it includes:
Sample ID;
Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);
Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;
Ordinal list for each timestamp;
Sample weight list (reserved);
Here is a demonstration function for parsing the TFRecord file:
import tensorflow as tf
def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds
keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }
class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()
def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict
def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight
t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)
#
Class Definition:
0: clear
1: opaque cloud
2: thin cloud
3: haze
4: cloud shadow
5: snow
Dataset Construction:
First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.
Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.
The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.