32 datasets found
  1. Z

    Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A...

    • data.niaid.nih.gov
    • zenodo.org
    Updated Feb 4, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8419699
    Explore at:
    Dataset updated
    Feb 4, 2024
    Dataset provided by
    Yin Ranyu
    Long Tengfei
    Jiao Weili
    Gong Chengjuan
    Wang Guizhou
    He Guojin
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

    Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

    Dataset Structure:

    Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

    For each sample, it includes:

    Sample ID;

    Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);

    Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;

    Ordinal list for each timestamp;

    Sample weight list (reserved);

    Here is a demonstration function for parsing the TFRecord file:

    import tensorflow as tf

    init Tensorflow Dataset from file name

    def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds

    keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }

    The Decoder (Optional)

    class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()

    def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict

    simple function

    def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight

    t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)

    #

    Class Definition:

    0: clear

    1: opaque cloud

    2: thin cloud

    3: haze

    4: cloud shadow

    5: snow

    Dataset Construction:

    First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

    Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

    The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.

  2. T

    cifar10

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). cifar10 [Dataset]. https://www.tensorflow.org/datasets/catalog/cifar10
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('cifar10', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/cifar10-3.0.2.png" alt="Visualization" width="500px">

  3. TFRecs for Severstal: Steel Defect Detection

    • kaggle.com
    Updated Mar 2, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    reyvaz (2021). TFRecs for Severstal: Steel Defect Detection [Dataset]. https://www.kaggle.com/datasets/reyvaz/tfrecs-for-severstal-steel-defect-detection/versions/1
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Mar 2, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    reyvaz
    Description

    TFRecs for Severstal: Steel Defect Detection Dataset

    Here is a demo of how to parse the examples and how to convert the RLEs to Masks.

    All labeled TFRecs, even those for examples with no defects, contain the rle feature. The rle feature is a list of 4 strings. Empty masks (i.e. mask channels) have been encoded as '1 0' so that they can be processed the same way as non-empty masks if needed, otherwise encodings are the same as in the original competition dataset.

    The TFRec Features are:

    features = {
      'image': tf.io.FixedLenFeature([], tf.string),
      'img_id': tf.io.FixedLenFeature([], tf.string),
      'height': tf.io.FixedLenFeature([], tf.int64),
      'width': tf.io.FixedLenFeature([], tf.int64),
        }
    
    if labeled:
      features['rle'] = tf.io.FixedLenFeature([4], tf.string)
      features['label'] = tf.io.FixedLenFeature([], tf.int64)
    

    Decoding the RLEs using Tensorflow

    IMAGE_SIZE = (256, 1600) # original image & mask size
    N_CHANNELS = 3 # original image channels
    N_CLASSES = 4 # channels of the mask
    
    def rle2mask(rle, image_size = IMAGE_SIZE):
      size = tf.math.reduce_prod(image_size)
    
      s = tf.strings.split(rle)
      s = tf.strings.to_number(s, tf.int32)
    
      starts = s[0::2] - 1
      lens = s[1::2]
    
      total_ones = tf.reduce_sum(lens)
      ones = tf.ones([total_ones], tf.int32)
    
      r = tf.range(total_ones)
      lens_cum = tf.math.cumsum(lens)
      s = tf.searchsorted(lens_cum, r, 'right')
      idx = r + tf.gather(starts - tf.pad(lens_cum[:-1], [(1, 0)]), s)
    
      mask_flat = tf.scatter_nd(tf.expand_dims(idx, 1), ones, [size])
      mask = tf.reshape(mask_flat, (image_size[1], image_size[0]))
      return tf.transpose(mask)
    
    def build_masks(rles, input_shape = IMAGE_SIZE, n_classes=N_CLASSES):
      masks = [rle2mask(rles[i]) for i in range(N_CLASSES)]
      masks = tf.stack(masks, axis = -1)
      return masks
    
    
  4. T

    imagenet2012

    • tensorflow.org
    Updated Jun 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imagenet2012 [Dataset]. https://www.tensorflow.org/datasets/catalog/imagenet2012
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    ILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.

    The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:

    1. Download the 2012 test split available here.
    2. Download the October 10, 2019 patch. There is a Google Drive link to the patch provided on the same page.
    3. Combine the two tar-balls, manually overwriting any images in the original archive with images from the patch. According to the instructions on image-net.org, this procedure overwrites just a few images.

    The resulting tar-ball may then be processed by TFDS.

    To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.

    To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:

    771 778 794 387 650
    363 691 764 923 427
    737 369 430 531 124
    755 930 755 59 168
    

    The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imagenet2012', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">

  5. T

    mnist

    • tensorflow.org
    • universe.roboflow.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/mnist
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    The MNIST database of handwritten digits.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('mnist', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/mnist-3.0.1.png" alt="Visualization" width="500px">

  6. Tensor framelet based iterative image reconstruction algorithm for low-dose...

    • plos.figshare.com
    bin
    Updated Jun 7, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haewon Nam; Minghao Guo; Hengyong Yu; Keumsil Lee; Ruijiang Li; Bin Han; Lei Xing; Rena Lee; Hao Gao (2023). Tensor framelet based iterative image reconstruction algorithm for low-dose multislice helical CT [Dataset]. http://doi.org/10.1371/journal.pone.0210410
    Explore at:
    binAvailable download formats
    Dataset updated
    Jun 7, 2023
    Dataset provided by
    PLOShttp://plos.org/
    Authors
    Haewon Nam; Minghao Guo; Hengyong Yu; Keumsil Lee; Ruijiang Li; Bin Han; Lei Xing; Rena Lee; Hao Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    In this study, we investigate the feasibility of improving the imaging quality for low-dose multislice helical computed tomography (CT) via iterative reconstruction with tensor framelet (TF) regularization. TF based algorithm is a high-order generalization of isotropic total variation regularization. It is implemented on a GPU platform for a fast parallel algorithm of X-ray forward band backward projections, with the flying focal spot into account. The solution algorithm for image reconstruction is based on the alternating direction method of multipliers or the so-called split Bregman method. The proposed method is validated using the experimental data from a Siemens SOMATOM Definition 64-slice helical CT scanner, in comparison with FDK, the Katsevich and the total variation (TV) algorithm. To test the algorithm performance with low-dose data, ACR and Rando phantoms were scanned with different dosages and the data was equally undersampled with various factors. The proposed method is robust for the low-dose data with 25% undersampling factor. Quantitative metrics have demonstrated that the proposed algorithm achieves superior results over other existing methods.

  7. Data from: Transfer learning reveals sequence determinants of the...

    • zenodo.org
    application/gzip, zip
    Updated May 29, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sahin Naqvi; Sahin Naqvi (2024). Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage [Dataset]. http://doi.org/10.5281/zenodo.11224809
    Explore at:
    application/gzip, zipAvailable download formats
    Dataset updated
    May 29, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Sahin Naqvi; Sahin Naqvi
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.

    Directory is organized into 4 subfolders, each tar'ed and gzipped:

    data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

    • atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples
    • all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023
    • atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

    baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

    • {sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
    • HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal
    • train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.
      • Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs.

    chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

    • Fine-tuning code, data, models
      • {all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds
      • pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data
      • finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)
      • best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task
      • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.
      • chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file
    • Interpretation code, data, models
      • contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)
      • sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)
      • match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).
      • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py
      • take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file.
      • chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of take_max_overlap.py. These CWM instances are the ones used throughout the paper.

    modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

    • modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

    mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

    • twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
    • twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.
    • MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.
  8. T

    oxford_iiit_pet

    • tensorflow.org
    • opendatalab.com
    Updated Mar 14, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2025). oxford_iiit_pet [Dataset]. https://www.tensorflow.org/datasets/catalog/oxford_iiit_pet
    Explore at:
    Dataset updated
    Mar 14, 2025
    Description

    The Oxford-IIIT pet dataset is a 37 category pet image dataset with roughly 200 images for each class. The images have large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed and species. Additionally, head bounding boxes are provided for the training split, allowing using this dataset for simple object detection tasks. In the test split, the bounding boxes are empty.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('oxford_iiit_pet', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  9. Anime Subtitles

    • kaggle.com
    Updated Aug 19, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Jess Fan (2021). Anime Subtitles [Dataset]. https://www.kaggle.com/datasets/jef1056/anime-subtitles/discussion
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Aug 19, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Jess Fan
    Description

    Content

    The original extracted versions (in .srt and .ass format) are also included in this release (which, idk why, but kaggle decompressed >:U)

    This dataset contains 1,497,770 messages across 3,836 episodes of anime. The raw dataset contains 1,563,442 messages, some of which were removed during cleaning.

    This version (V4) adapts the original (frankly, terrible) format into the newer format I developed, which is used in https://github.com/JEF1056/clean-discord. The Dataset folder contains compressed text files, which are compatable with tensorflow datasets. These can be streamed as a textlinedataset in the TSV format.

    V4 also fixes many (but not all) issues that the original cleaning script was too simple to realistically take care of. It also uses the clean-discord cleaner algorithms to make sentences more natural language than formatting. The script has also been optimized to run on multi-core systems, allowing it to complete cleaning this entire dataset in under 30 seconds on a 4-core machine. See the new and impoved script here: https://github.com/JEF1056/clean-discord/blob/v1.2/misc/anime.py (no longer bundled in the dataset files)

    Format

    The files are now all compressed to save space, and are compatable with tensorflow datasets. You can initialize a dataset function as such: def dataset_fn_local(split, shuffle_files=False): global nq_tsv_path del shuffle_files # Load lines from the text file as examples. files_to_read=[os.path.join(nq_tsv_path[split],filename) for filename in os.listdir(nq_tsv_path[split]) if filename.startswith(split)] print(f"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Split {split} contains {len(files_to_read)} files. First 10: {files_to_read[0:10]}") ds = tf.data.TextLineDataset(files_to_read, compression_type="GZIP").filter(lambda line:tf.not_equal(tf.strings.length(line),0)) ds = ds.shuffle(buffer_size=600000) ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["",""], field_delim="\t", use_quote_delim=False), num_parallel_calls=tf.data.experimental.AUTOTUNE) ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex))) return ds

    Acknowledgements

    A sincere thanks to all of my friends for helping me come up with anime titles, a shoutout to the talented and dedicated people translating Japanese anime, and an even bigger thanks to Leen Chan for compiling the actual subtitles.

    This dataset is far from complete! I hope that people who are willing to find, add and clean the data are out there, and could do their best to try and help out in the effort to grow this data

  10. T

    food101

    • tensorflow.org
    • paperswithcode.com
    • +3more
    Updated Nov 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2022). food101 [Dataset]. https://www.tensorflow.org/datasets/catalog/food101
    Explore at:
    Dataset updated
    Nov 23, 2022
    Description

    This dataset consists of 101 food categories, with 101'000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('food101', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/food101-2.0.0.png" alt="Visualization" width="500px">

  11. PP2021 - KFold TFRecords

    • kaggle.com
    Updated Apr 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Kuzmenkov (2021). PP2021 - KFold TFRecords [Dataset]. https://www.kaggle.com/nickuzmenkov/pp2021-kfold-tfrecords-0/activity
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 13, 2021
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    Nick Kuzmenkov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes: * labels column of the initial train.csv DataFrame was binarized to multi-label format columns: complex, frog_eye_leaf_spot, healthy, powdery_mildew, rust, and scab * images were scaled to 512x512 * 77 duplicate images having different labels were removed (see the context in this notebook) * samples were stratified and split into 5 folds (see corresponding folders fold_0:fold_4) * each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)

    I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.

    For a complete example see my TPU Training Notebook

    Contents:

    • preprocessed DataFrame train.csv
    • fold indexes DataFrame folds.csv
    • fold_0:fold_4 folders containing 64 .tfrec files, respectively, with feature map shown below: feature_map = { 'image': tf.io.FixedLenFeature([], tf.string), 'name': tf.io.FixedLenFeature([], tf.string), 'complex': tf.io.FixedLenFeature([], tf.int64), 'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64), 'healthy': tf.io.FixedLenFeature([], tf.int64), 'powdery_mildew': tf.io.FixedLenFeature([], tf.int64), 'rust': tf.io.FixedLenFeature([], tf.int64), 'scab': tf.io.FixedLenFeature([], tf.int64)} ### Acknowledgements
    • photo from Unsplash here
  12. PP2021 - Augmented KFold TFRecords (2/4)

    • kaggle.com
    zip
    Updated Apr 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Nick Kuzmenkov (2021). PP2021 - Augmented KFold TFRecords (2/4) [Dataset]. https://www.kaggle.com/nickuzmenkov/pp2021-kfold-tfrecords-1
    Explore at:
    zip(13051777813 bytes)Available download formats
    Dataset updated
    Apr 13, 2021
    Authors
    Nick Kuzmenkov
    License

    https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/

    Description

    Description

    Dataset of TFRecords files made from Plant Pathology 2021 original competition data. Changes: * labels column of the initial train.csv DataFrame was binarized to multi-label format columns: complex, frog_eye_leaf_spot, healthy, powdery_mildew, rust, and scab * images were scaled to 512x512 * 77 duplicate images having different labels were removed (see the context in this notebook) * samples were stratified and split into 5 folds (see corresponding folders fold_0:fold_4) * images were heavily augmented with albumentations library (for raw images see this dataset) * each folder contains 5 copies of randomly augmented initial images (so that the model never meets the same images)

    I suggest adding all 5 datasets to your notebook: 4 augmented datasets = 20 epochs of unique images (1, 2, 3, 4) + 1 raw dataset for validation here.

    For a complete example see my TPU Training Notebook

    Contents:

    • preprocessed DataFrame train.csv
    • fold indexes DataFrame folds.csv
    • fold_0:fold_4 folders containing 64 .tfrec files, respectively, with feature map shown below: feature_map = { 'image': tf.io.FixedLenFeature([], tf.string), 'name': tf.io.FixedLenFeature([], tf.string), 'complex': tf.io.FixedLenFeature([], tf.int64), 'frog_eye_leaf_spot': tf.io.FixedLenFeature([], tf.int64), 'healthy': tf.io.FixedLenFeature([], tf.int64), 'powdery_mildew': tf.io.FixedLenFeature([], tf.int64), 'rust': tf.io.FixedLenFeature([], tf.int64), 'scab': tf.io.FixedLenFeature([], tf.int64)} ### Acknowledgements
    • photo from Unsplash here
  13. f

    Federated EMNIST Dataset

    • figshare.com
    xz
    Updated Jul 16, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Saroj Mali (2024). Federated EMNIST Dataset [Dataset]. http://doi.org/10.6084/m9.figshare.26308777.v1
    Explore at:
    xzAvailable download formats
    Dataset updated
    Jul 16, 2024
    Dataset provided by
    figshare
    Authors
    Saroj Mali
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset is derived from the Leaf repository (https://github.com/TalwalkarLab/leaf) pre-processing of the Extended MNIST dataset, grouping examples by writer. Details about Leaf were published in "LEAF: A Benchmark for Federated Settings" https://arxiv.org/abs/1812.01097Note: This dataset does not include some additional preprocessing that MNIST includes, such as size-normalization and centering. In the Federated EMNIST data, the value of 1.0 corresponds to the background, and 0.0 corresponds to the color of the digits themselves; this is the inverse of some MNIST representations, e.g. in tensorflow_datasets, where 0 corresponds to the background color, and 255 represents the color of the digit.Data set sizes:only_digits=True: 3,383 users, 10 label classestrain: 341,873 examplestest: 40,832 examplesonly_digits=False: 3,400 users, 62 label classestrain: 671,585 examplestest: 77,483 examplesRather than holding out specific users, each user's examples are split across train and test so that all users have at least one example in train and one example in test. Writers that had less than 2 examples are excluded from the data set.The tf.data.Datasets returned by tff.simulation.datasets.ClientData.create_tf_dataset_for_client will yield collections.OrderedDict objects at each iteration, with the following keys and values, in lexicographic order by key:'label': a tf.Tensor with dtype=tf.int32 and shape [1], the class label of the corresponding pixels. Labels [0-9] correspond to the digits classes, labels [10-35] correspond to the uppercase classes (e.g., label 11 is 'B'), and labels [36-61] correspond to the lowercase classes (e.g., label 37 is 'b').'pixels': a tf.Tensor with dtype=tf.float32 and shape [28, 28], containing the pixels of the handwritten digit, with values in the range [0.0, 1.0].Argsonly_digits(Optional) whether to only include examples that are from the digits [0-9] classes. If False, includes lower and upper case characters, for a total of 62 class labels.cache_dir(Optional) directory to cache the downloaded file. If None, caches in Keras' default cache directory.ReturnsTuple of (train, test) where the tuple elements are tff.simulation.datasets.ClientData objects.

  14. f

    coTRaCTE predicts co-occurring transcription factors within cell-type...

    • plos.figshare.com
    pdf
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Alena van Bömmel; Michael I. Love; Ho-Ryun Chung; Martin Vingron (2023). coTRaCTE predicts co-occurring transcription factors within cell-type specific enhancers [Dataset]. http://doi.org/10.1371/journal.pcbi.1006372
    Explore at:
    pdfAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS Computational Biology
    Authors
    Alena van Bömmel; Michael I. Love; Ho-Ryun Chung; Martin Vingron
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Cell-type specific gene expression is regulated by the combinatorial action of transcription factors (TFs). In this study, we predict transcription factor (TF) combinations that cooperatively bind in a cell-type specific manner. We first divide DNase hypersensitive sites into cell-type specifically open vs. ubiquitously open sites in 64 cell types to describe possible cell-type specific enhancers. Based on the pattern contrast between these two groups of sequences we develop “co-occurring TF predictor on Cell-Type specific Enhancers” (coTRaCTE) - a novel statistical method to determine regulatory TF co-occurrences. Contrasting the co-binding of TF pairs between cell-type specific and ubiquitously open chromatin guarantees the high cell-type specificity of the predictions. coTRaCTE predicts more than 2000 co-occurring TF pairs in 64 cell types. The large majority (70%) of these TF pairs is highly cell-type specific and overlaps in TF pair co-occurrence are highly consistent among related cell types. Furthermore, independently validated co-occurring and directly interacting TFs are significantly enriched in our predictions. Focusing on the regulatory network derived from the predicted co-occurring TF pairs in embryonic stem cells (ESCs) we find that it consists of three subnetworks with distinct functions: maintenance of pluripotency governed by OCT4, SOX2 and NANOG, regulation of early development governed by KLF4, STAT3, ZIC3 and ZNF148 and general functions governed by MYC, TCF3 and YY1. In summary, coTRaCTE predicts highly cell-type specific co-occurring TFs which reveal new insights into transcriptional regulatory mechanisms.

  15. T

    tiny_shakespeare

    • tensorflow.org
    • huggingface.co
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2023). tiny_shakespeare [Dataset]. https://www.tensorflow.org/datasets/catalog/tiny_shakespeare
    Explore at:
    Dataset updated
    Feb 11, 2023
    Description

    40,000 lines of Shakespeare from a variety of Shakespeare's plays. Featured in Andrej Karpathy's blog post 'The Unreasonable Effectiveness of Recurrent Neural Networks': http://karpathy.github.io/2015/05/21/rnn-effectiveness/.

    To use for e.g. character modelling:

    d = tfds.load(name='tiny_shakespeare')['train']
    d = d.map(lambda x: tf.strings.unicode_split(x['text'], 'UTF-8'))
    # train split includes vocabulary for other splits
    vocabulary = sorted(set(next(iter(d)).numpy()))
    d = d.map(lambda x: {'cur_char': x[:-1], 'next_char': x[1:]})
    d = d.unbatch()
    seq_len = 100
    batch_size = 2
    d = d.batch(seq_len)
    d = d.batch(batch_size)
    

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('tiny_shakespeare', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  16. f

    Correlation (quantified by Pearson correlation coefficient) of eigenvectors...

    • plos.figshare.com
    xls
    Updated May 31, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    R. Gabdoulline; W. Kaisers; A. Gaspar; K. Meganathan; M. X. Doss; S. Jagtap; J. Hescheler; A. Sachinidis; H. Schwender (2023). Correlation (quantified by Pearson correlation coefficient) of eigenvectors from Principal Component Analysis (PCA) at different time intervals of hESC development with eigenvectors of 0–10 hour mESC development. [Dataset]. http://doi.org/10.1371/journal.pone.0140803.t003
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 31, 2023
    Dataset provided by
    PLOS ONE
    Authors
    R. Gabdoulline; W. Kaisers; A. Gaspar; K. Meganathan; M. X. Doss; S. Jagtap; J. Hescheler; A. Sachinidis; H. Schwender
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    For correlating eigenvectors we interpolated expression values in both cases to 11 equidistant points, concatenated 2 sets of profiles and performed PCA. Then the 22-point eigenvectors were split onto 2 parts, corresponding to mESC and hESC points, and compared. The first 2 eigenvectors were semi quantitatively similar; therefore eigenvector 1 was also compared to eigenvector 2.Correlation (quantified by Pearson correlation coefficient) of eigenvectors from Principal Component Analysis (PCA) at different time intervals of hESC development with eigenvectors of 0–10 hour mESC development.

  17. f

    TF-C Pretrain EMG

    • figshare.com
    zip
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Xiang Zhang; Ziyuan Zhao; Theodoros Tsiligkaridis; Marinka Zitnik (2023). TF-C Pretrain EMG [Dataset]. http://doi.org/10.6084/m9.figshare.19930250.v1
    Explore at:
    zipAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    figshare
    Authors
    Xiang Zhang; Ziyuan Zhao; Theodoros Tsiligkaridis; Marinka Zitnik
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Electromyograms (EMG) measures muscle responses as electrical activity to neural stimulation, and they can be use to diagnose certain muscular dystrophies and neuropathies. EMG consists of single-channel EMG recording from the tibialis anterior muscle of three volunteers that are healthy, suffering from neuropathy, and suffering from myopathy, respectively. The recordings are sampled with the frequency of 4K Hz. Each patient, i.e., their disorder, is a separate classification category. Then the recordings are split into time series samples using a fixed-length window of 1,500 observations.

  18. T

    imdb_reviews

    • tensorflow.org
    • kaggle.com
    Updated Sep 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). imdb_reviews [Dataset]. https://www.tensorflow.org/datasets/catalog/imdb_reviews
    Explore at:
    Dataset updated
    Sep 20, 2024
    Description

    Large Movie Review Dataset. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('imdb_reviews', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

  19. f

    Scan parameters with different voltages.

    • plos.figshare.com
    xls
    Updated May 30, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Haewon Nam; Minghao Guo; Hengyong Yu; Keumsil Lee; Ruijiang Li; Bin Han; Lei Xing; Rena Lee; Hao Gao (2023). Scan parameters with different voltages. [Dataset]. http://doi.org/10.1371/journal.pone.0210410.t001
    Explore at:
    xlsAvailable download formats
    Dataset updated
    May 30, 2023
    Dataset provided by
    PLOS ONE
    Authors
    Haewon Nam; Minghao Guo; Hengyong Yu; Keumsil Lee; Ruijiang Li; Bin Han; Lei Xing; Rena Lee; Hao Gao
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    Scan parameters with different voltages.

  20. T

    fashion_mnist

    • tensorflow.org
    • opendatalab.com
    • +3more
    Updated Jun 1, 2024
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2024). fashion_mnist [Dataset]. https://www.tensorflow.org/datasets/catalog/fashion_mnist
    Explore at:
    Dataset updated
    Jun 1, 2024
    Description

    Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.

    To use this dataset:

    import tensorflow_datasets as tfds
    
    ds = tfds.load('fashion_mnist', split='train')
    for ex in ds.take(4):
     print(ex)
    

    See the guide for more informations on tensorflow_datasets.

    https://storage.googleapis.com/tfds-data/visualization/fig/fashion_mnist-3.0.1.png" alt="Visualization" width="500px">

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Wang Guizhou (2024). Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8419699

Dataset for "Enhancing Cloud Detection in Sentinel-2 Imagery: A Spatial-Temporal Approach and Dataset"

Explore at:
Dataset updated
Feb 4, 2024
Dataset provided by
Yin Ranyu
Long Tengfei
Jiao Weili
Gong Chengjuan
Wang Guizhou
He Guojin
License

Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically

Description

This dataset is built for time-series Sentinel-2 cloud detection and stored in Tensorflow TFRecord (refer to https://www.tensorflow.org/tutorials/load_data/tfrecord).

Each file is compressed in 7z format and can be decompressed using Bandzip or 7-zip software.

Dataset Structure:

Each filename can be split into three parts using underscores. The first part indicates whether it is designated for training or validation ('train' or 'val'); the second part indicates the Sentinel-2 tile name, and the last part indicates the number of samples in this file.

For each sample, it includes:

Sample ID;

Array of time series 4 band image patches in 10m resolution, shaped as (n_timestamps, 4, 42, 42);

Label list indicating cloud cover status for the center (6\times6) pixels of each timestamp;

Ordinal list for each timestamp;

Sample weight list (reserved);

Here is a demonstration function for parsing the TFRecord file:

import tensorflow as tf

init Tensorflow Dataset from file name

def parseRecordDirect(fname): sep = '/' parts = tf.strings.split(fname,sep) tn = tf.strings.split(parts[-1],sep='_')[-2] nn = tf.strings.to_number(tf.strings.split(parts[-1],sep='_')[-1],tf.dtypes.int64) t = tf.data.Dataset.from_tensors(tn).repeat().take(nn) t1 = tf.data.TFRecordDataset(fname) ds = tf.data.Dataset.zip((t, t1)) return ds

keys_to_features_direct = { 'localid': tf.io.FixedLenFeature([], tf.int64, -1), 'image_raw_ldseries': tf.io.FixedLenFeature((), tf.string, ''), 'labels': tf.io.FixedLenFeature((), tf.string, ''), 'dates': tf.io.FixedLenFeature((), tf.string, ''), 'weights': tf.io.FixedLenFeature((), tf.string, '') }

The Decoder (Optional)

class SeriesClassificationDirectDecorder(decoder.Decoder): """A tf.Example decoder for tfds classification datasets.""" def init(self) -> None: super()._init_()

def decode(self, tid, ds): parsed = tf.io.parse_single_example(ds, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) sample_dict = { 'tid': tid, # tile ID 'dates': dates, # Date list 'localid': parsed['localid'], # sample ID 'imgs': decoded, # image array 'labels': label, # label list 'weights': weight } return sample_dict

simple function

def preprocessDirect(tid, record): parsed = tf.io.parse_single_example(record, keys_to_features_direct) encoded = parsed['image_raw_ldseries'] labels_encoded = parsed['labels'] decoded = tf.io.decode_raw(encoded, tf.uint16) label = tf.io.decode_raw(labels_encoded, tf.int8) dates = tf.io.decode_raw(parsed['dates'], tf.int64) weight = tf.io.decode_raw(parsed['weights'], tf.float32) decoded = tf.reshape(decoded,[-1,4,42,42]) return tid, dates, parsed['localid'], decoded, label, weight

t1 = parseRecordDirect('filename here') dataset = t1.map(preprocessDirect, num_parallel_calls=tf.data.experimental.AUTOTUNE)

#

Class Definition:

0: clear

1: opaque cloud

2: thin cloud

3: haze

4: cloud shadow

5: snow

Dataset Construction:

First, we randomly generate 500 points for each tile, and all these points are aligned to the pixel grid center of the subdatasets in 60m resolution (eg. B10) for consistence when comparing with other products. It is because that other cloud detection method may use the cirrus band as features, which is in 60m resolution.

Then, the time series image patches of two shapes are cropped with each point as the center.The patches of shape (42 \times 42) are cropped from the bands in 10m resolution (B2, B3, B4, B8) and are used to construct this dataset.And the patches of shape (348 \times 348) are cropped from the True Colour Image (TCI, details see sentinel-2 user guide) file and are used to interpreting class labels.

The samples with a large number of timestamps could be time-consuming in the IO stage, thus the time series patches are divided into different groups with timestamps not exceeding 100 for every group.

Search
Clear search
Close search
Google apps
Main menu