Facebook
TwitterThis is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Compilation of python codes for data preprocessing and VegeNet building, as well as image datasets (zip files).
Image datasets:
Facebook
TwitterOpen Images is a dataset of ~9M images that have been annotated with image-level labels and object bounding boxes.
The training set of V4 contains 14.6M bounding boxes for 600 object classes on 1.74M images, making it the largest existing dataset with object location annotations. The boxes have been largely manually drawn by professional annotators to ensure accuracy and consistency. The images are very diverse and often contain complex scenes with several objects (8.4 per image on average). Moreover, the dataset is annotated with image-level labels spanning thousands of classes.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('open_images_v4', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/open_images_v4-original-2.0.0.png" alt="Visualization" width="500px">
Facebook
TwitterAttribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
http://www.raspberryturk.com/assets/img/logo.png" alt="Raspberry Turk logo">
This dataset was created as part of the Raspberry Turk project. The Raspberry Turk is a robot that can play chess—it's entirely open source, based on Raspberry Pi, and inspired by the 18th century chess playing machine, the Mechanical Turk. The dataset was used to train models for the vision portion of the project.
http://www.raspberryturk.com/assets/img/rawcapture.png" alt="Raw chessboard image">
In the raw form the dataset contains 312 480x480 images of chessboards with their associated board FENs. Each chessboard contains 30 empty squares, 8 orange pawns, 2 orange knights, 2 orange bishops, 2 orange rooks, 2 orange queens, 1 orange king, 8 green pawns, 2 green knights, 2 green bishops, 2 green rooks, 2 green queens, and 1 green king arranged in different random positions.
The Raspberry Turk source code includes several scripts for converting this raw data to a more usable form.
To get started download the raw.zip file below and then:
$ git clone git@github.com:joeymeyer/raspberryturk.git
$ cd raspberryturk
$ unzip ~/Downloads/raw.zip -d data
$ conda env create -f data/environment.yml
$ source activate raspberryturk
From this point there are two scripts you will need to run. First, convert the raw data to an interim form (individual 60x60 rgb/grayscale images) using process_raw.py like this:
$ python -m raspberryturk.core.data.process_raw data/raw/ data/interim/
This will split the raw images into individual squares and put them in labeled folders inside the interim folder. The final step is to convert the images into a dataset that can be loaded into a numpy array for training/validation. The create_dataset.py utility accomplishes this. The tool takes a number of parameters that can be used to customize the dataset (ex. choose the labels, rgb/grayscale, zca whiten images first, include rotated images, etc). Below is the documentation for create_dataset.py.
$ python -m raspberryturk.core.data.create_dataset --help
usage: raspberryturk/core/data/create_dataset.py [-h] [-g] [-r] [-s SAMPLE]
[-o] [-t TEST_SIZE] [-e] [-z]
base_path
{empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
filename
Utility used to create a dataset from processed images.
positional arguments:
base_path Base path for data processing.
{empty_or_not,white_or_black,color_piece,color_piece_noempty,piece,piece_noempty}
Encoding function to use for piece classification. See
class_encoding.py for possible values.
filename Output filename for dataset. Should be .npz
optional arguments:
-h, --help show this help message and exit
-g, --grayscale Dataset should use grayscale images.
-r, --rotation Dataset should use rotated images.
-s SAMPLE, --sample SAMPLE
Dataset should be created by only a sample of images.
Must be value between 0 and 1.
-o, --one_hot Dataset should use one hot encoding for labels.
-t TEST_SIZE, --test_size TEST_SIZE
Test set partition size. Must be value between 0 and
1.
-e, --equalize_classes
Equalize class distributions.
-z, --zca ZCA whiten dataset.
Example of how it can be used:
$ python -m raspberryturk.core.data.create_dataset data/interim/ promotable_piece data/processed/example_dataset.npz --rotation --grayscale --one_hot --sample=0.3 --zca
Finally, the dataset is created and can be easily loaded into Python either using raspberryturk.core.data.dataset.Dataset or simply np.load.
In [1]: from raspberryturk.core.data.dataset import Dataset
In [2]: d = Dataset.load_file('data/processed/example_dataset.npz')
or
In [1]: with open('data/processed/example_dataset.npz', 'r') as f:
: data = np.load(f)
Visit the data collection page of the Raspberry Turk website for more details.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. We intend Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits.
* Source
Here's an example of how the data looks (each class takes three-rows):
https://github.com/zalandoresearch/fashion-mnist/raw/master/doc/img/fashion-mnist-sprite.png" alt="Visualized Fashion MNIST dataset">
train (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.train set split to provide 80% of its images to the training set and 20% of its images to the validation set@online{xiao2017/online,
author = {Han Xiao and Kashif Rasul and Roland Vollgraf},
title = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms},
date = {2017-08-28},
year = {2017},
eprintclass = {cs.LG},
eprinttype = {arXiv},
eprint = {cs.LG/1708.07747},
}
Facebook
TwitterAttribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This is the readme for the supplemental data for our ICDAR 2019 paper.
You can read our paper via IEEE here: https://ieeexplore.ieee.org/document/8978202
If you found this dataset useful, please consider citing our paper:
@inproceedings{DBLP:conf/icdar/MorrisTE19,
author = {David Morris and
Peichen Tang and
Ralph Ewerth},
title = {A Neural Approach for Text Extraction from Scholarly Figures},
booktitle = {2019 International Conference on Document Analysis and Recognition,
{ICDAR} 2019, Sydney, Australia, September 20-25, 2019},
pages = {1438--1443},
publisher = {{IEEE}},
year = {2019},
url = {https://doi.org/10.1109/ICDAR.2019.00231},
doi = {10.1109/ICDAR.2019.00231},
timestamp = {Tue, 04 Feb 2020 13:28:39 +0100},
biburl = {https://dblp.org/rec/conf/icdar/MorrisTE19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).
We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.
These datasets contain a readme with license information. Further information about the associated project can be found in the authors' published work we cited: https://doi.org/10.1007/978-3-319-51811-4_2
The DeGruyter dataset does not include the labeled images due to license restrictions. As of writing, the images can still be downloaded from DeGruyter via the links in the readme. Note that depending on what program you use to strip the images out of the PDF they are provided in, you may have to re-number the images.
We used label_generator's generated dataset, which the author made available on a requester-pays amazon s3 bucket. We also used the Multi-Type Web Images dataset, which is mirrored here.
We have made our code available in code.zip. We will upload code, announce further news, and field questions via the github repo.
Our text detection network is adapted from Argman's EAST implementation. The EAST/checkpoints/ours subdirectory contains the trained weights we used in the paper.
We used a tesseract script to run text extraction from detected text rows. This is inside our code code.tar as text_recognition_multipro.py.
We used a java script provided by Falk Böschen and adapted to our file structure. We included this as evaluator.jar.
Parameter sweeps are automated by param_sweep.rb. This file also shows how to invoke all of these components.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
In the article, we trained and evaluated models on the Image Privacy Dataset (IPD) and the PrivacyAlert dataset. The datasets are originally provided by other sources and have been re-organised and curated for this work.
Our curation organises the datasets in a common structure. We updated the annotations and labelled the splits of the data in the annotation file. This avoids having separated folders of images for each data split (training, validation, testing) and allows a flexible handling of new splits, e.g. created with a stratified K-Fold cross-validation procedure. As for the original datasets (PicAlert and PrivacyAlert), we provide the link to the images in bash scripts to download the images. Another bash script re-organises the images in sub-folders with maximum 1000 images in each folder.
Both datasets refer to images publicly available on Flickr. These images have a large variety of content, including sensitive content, seminude people, vehicle plates, documents, private events. Images were annotated with a binary label denoting if the content was deemed to be public or private. As the images are publicly available, their label is mostly public. These datasets have therefore a high imbalance towards the public class. Note that IPD combines two other existing datasets, PicAlert and part of VISPR, to increase the number of private images already limited in PicAlert. Further details in our corresponding https://doi.org/10.48550/arXiv.2503.12464" target="_blank" rel="noopener">publication.
List of datasets and their original source:
Notes:
Some of the models run their pipeline end-to-end with the images as input, whereas other models require different or additional inputs. These inputs include the pre-computed visual entities (scene types and object types) represented in a graph format, e.g. for a Graph Neural Network. Re-using these pre-computed visual entities allows other researcher to build new models based on these features while avoiding re-computing the same on their own or for each epoch during the training of a model (faster training).
For each image of each dataset, namely PrivacyAlert, PicAlert, and VISPR, we provide the predicted scene probabilities as a .csv file , the detected objects as a .json file in COCO data format, and the node features (visual entities already organised in graph format with their features) as a .json file. For consistency, all the files are already organised in batches following the structure of the images in the datasets folder. For each dataset, we also provide the pre-computed adjacency matrix for the graph data.
Note: IPD is based on PicAlert and VISPR and therefore IPD refers to the scene probabilities and object detections of the other two datasets. Both PicAlert and VISPR must be downloaded and prepared to use IPD for training and testing.
Further details on downloading and organising data can be found in our GitHub repository: https://github.com/graphnex/privacy-from-visual-entities (see ARTIFACT-EVALUATION.md#pre-computed-visual-entitities-)
If you have any enquiries, question, or comments, or you would like to file a bug report or a feature request, use the issue tracker of our GitHub repository.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
This dataset contains a collection of images featuring individual Lao characters, specifically designed for image classification tasks. The dataset is organized into folders, where each folder is named directly with the Lao character it represents (e.g., a folder named "ກ", a folder named "ຂ", and so on) and contains 100 images of that character.
The dataset comprises images of 44 distinct Lao characters, including consonants, vowels, and tone marks.
- The dataset is divided into 44 folders.
- Each folder is named with the actual Lao character it contains.
- Each folder contains 100 images of the corresponding Lao character.
- This results in a total of 4400 images in the dataset.
- Training and evaluating image classification models for Lao character recognition.
- Developing Optical Character Recognition (OCR) systems for the Lao language.
- Research in computer vision and pattern recognition for Southeast Asian scripts.
The nature of these images (white characters on a black background) lends itself well to various data augmentation techniques to improve model robustness and performance. Consider applying augmentations such as:
- Geometric Transformations:
- Zoom (in/out)
- Height and width shifts
- Rotation
- Perspective transforms
- Blurring Effects:
- Standard blur
- Motion blur
- Noise Injection:
- Gaussian noise
Applying these augmentations can help create a more diverse training set and potentially lead to better generalization on unseen data.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
train set split to provide 80% of its images to the training set and 20% of its images to the validation settrain set split to provide 80% of its images to the training set and 20% of its images to the validation set0, 1, 2, 3, 4, 5, 6, 7, 8, 9 to one, two, three, four, five, six, seven, eight, ninetrain (86% of images - 60,000 images) set and test (14% of images - 10,000 images) set only.@article{lecun2010mnist,
title={MNIST handwritten digit database},
author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
volume={2},
year={2010}
}
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Original paper: Miyawaki Y, Uchida H, Yamashita O, Sato M, Morito Y, Tanabe HC, Sadato N & Kamitani Y (2008) Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders. Neuron 60:915-929.
This is the fMRI data from Miyawaki et al. (2008) "Visual image reconstruction from human brain activity using a combination of multiscale local image decoders". Neuron 60:915-29. In this study, we collected fMRI activity from subjects viewing images, and constructed decoders predicting local image contrast at multiple spatial scales. The combined decoders based on a linear model successfully reconstructed presented stimuli from fMRI activity.
The experiment consisted of human subjects viewing contrast-based images of 12 x 12 flickering patches. There were two types of image viewing tasks: (1) random image viewing and (2) figure image (geometric shape or alphabet letter) viewing. For image presentation, a block design was used with rest periods between the presentation of each image. For random image patch presentation, images were presented for 6 s, followed by 6 s rest. For figure image presentation, images were presented for 12 s, followed by 12 s rest. The data from random image viewing runs were used to train the decoding models, and the trained model were evaluated with the data from figure image viewing runs.
This dataset contains two subjects ('sub-01' and 'sub-02'). The subjects performed two sessions of fMRI experiments ('ses-01' and 'ses-02'). Each session is composed of several EPI runs (TR, 2000 ms; TE, 30 ms; flip angle, 80°; voxel size, 3 × 3 × 3 mm; FOV, 192 × 192 mm; number of slices, 30, slice gap, 0 mm) and inplane T2-weighted imaging (TR, 6000 ms; TE, 57 ms; flip angle, 90°; voxel size, 0.75 × 0.75 × 3.0 mm; FOV, 192 × 192 mm). The EPI images covered the entire occipital lobe. The dataset also includes a T1-weighted anatomical reference image for each subject (TR, 2250 ms; TE, 2.98 ms for sub-01 and 3.06 ms for sub-02; TI, 900 ms; flip angle, 9°; voxel size, 1.0 × 1.0 × 1.0 mm; FOV, 256 × 256 mm). The T1w images were obtained in sessions different from the fMRI experiment sessions and stored in 'ses-anat' directories. The T1w images were defaced by pydeface (https://pypi.python.org/pypi/pydeface). All DICOM files are converted to Nifti-1 files by mri_convert in FreeSurfer. In addition, the dataset contains mask images of manually defined ROIs for each subjects in sourcedata directory (See README in sourcedata for more details).
During fMRI runs, the subject viewed contrast-based images of 12 × 12 flickering image patches. Two types of runs ('viewRandom' and 'viewFigure') were included in the experiment. In 'viewRandom' runs, random images were presented as visual stimuli. Each 'viewRandom' runs consisted of 22 stimulus presentation trials and lasted for 298 s (149 volumes). The two subjects performed 20 'viewRandom' runs. In 'viewFigure' runs, either geometric shape pattern (square, small frame, large frame, plus, X) or alphabet letter pattern (n, e, u, r, o) was presented in each trial. In addition, data while the subject viewed thin and large alphabet letter patterns (n, e, u, r, o) are included in the dataset (they are not included in the results of the original study). Each 'viewFigure' run consisted of 10 stimulus presentation trials and lasted for 268 s (134 volumes). The 'sub-01' and 'sub-02' performed 12 and 10 'viewFigure' runs, respectively.
To help subjects suppress eye blinks and firmly fixate the eyes, the color of the fixation spot changed from white to red 2 s before each stimulus block started. To ensure alertness, subjects were instructed to detect the color change of the fixation (red to green, 100 ms) that occurred after a random interval of 3–5 s from the beginning of each stimulus block. Performances of the subject was monitored online during experiments, but were not recorded and omitted from the dataset.
The value of trial_type in the task event files (*_events.tsv) indicates the type of each trial (block) as below.
rest: Rest trial (no visual stimulus).stimulus_random: Random pattern.stimulus_shape: Geometric shape pattern (square, small frame, large frame, plus, X).stimulus_alphabet: Alphabet pattern (n, e, u, r, o).stimulus_alphabet_thin: Thin alphabet pattern (n, e, u, r, o).stimulus_alphabet_long: Long alphabet pattern (n, e, u, r, o).Note that the results from thin and long alphabet patterns are not included in the original paper although the data were obtained in the same sessions.
Additional column stimulus_pattern contains the pattern of stimuli (12 × 12) presented in each stimulus trial. It is vectorized in row-major order. Each element in the vector corresponds to a patch (1.15° × 1.15°) in a stimulus pattern. 1 and 0 represnets a flickering checkerboard and a gray area, respectively. For example, stimulus pattern of
000000000000000000000000000000000000000111111000000111111000000110011000000110011000000110011000000110011000000000000000000000000000000000000000
represents the following stimulus.
000000000000
000000000000
000000000000
000111111000
000111111000
000110011000
000110011000
000110011000
000110011000
000000000000
000000000000
000000000000
The column holds 'null' for rest trials.
===========================================
Pydeface was used on all anatomical images to ensure de-identification of subjects. The code can be found at https://github.com/poldracklab/pydeface
MRIQC was run on the dataset. Results are located in derivatives/mriqc. Learn more about it here: https://mriqc.readthedocs.io/en/stable/
1) www.openfmri.org/dataset/ds******/ See the comments section at the bottom of the dataset page. 2) www.neurostars.org Please tag any discussion topics with the tags openfmri and dsXXXXXX. 3) Send an email to submissions@openfmri.org. Please include the accession number in your email.
-behavioral performance data is not accompanied this dataset as submitter didn't submit.
Facebook
TwitterDataset Card for aedupuga/cards-image-dataset
Dataset Description
This Dataset consists of images of some of the cards in 2 different card decks labelled as Face (0) or Value(1)
Curated by: Anuhya Edupuganti
Uses
Direct Use
Training and evaluating image classification models Experimenting with image preprocessing (resizing and augmentation)
Dataset Structure
This data set contains teo splits:
original: 30 samples of cards from… See the full description on the dataset page: https://huggingface.co/datasets/aedupuga/cards-image-dataset.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Project Page
Paper
https://arxiv.org/abs/2210.10732
Overview
OpenEarthMap is a benchmark dataset for global high-resolution land cover mapping. OpenEarthMap consists of 5000 aerial and satellite images with manually annotated 8-class land cover labels and 2.2 million segments at a 0.25-0.5m ground sampling distance, covering 97 regions from 44 countries across 6 continents. OpenEarthMap fosters research including but not limited to semantic segmentation and domain adaptation. Land cover mapping models trained on OpenEarthMap generalize worldwide and can be used as off-the-shelf models in a variety of applications.
Reference
@inproceedings{xia_2023_openearthmap, title = {OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping}, author = {Junshi Xia and Naoto Yokoya and Bruno Adriano and Clifford Broni-Bediako}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {6254-6264} }
License
Label data of OpenEarthMap are provided under the same license as the original RGB images, which varies with each source dataset. For more details, please see the attribution of source data here. Label data for regions where the original RGB images are in the public domain or where the license is not explicitly stated are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Note for xBD data
The RGB images of xBD dataset are not included in the OpenEarthMap dataset. Please download the xBD RGB images from https://xview2.org/dataset and add them to the corresponding folders. The "xbd_files.csv" contains information about how to prepare the xBD RGB images and add them to the corresponding folders.
Code
Sample code to add the xBD RGB images to the distributed OpenEarthMap dataset and to train baseline models is available here.
Leaderboard
Performance on the test set can be evaluated on the Codalab webpage.
Facebook
TwitterCC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Horikawa, T. & Kamitani, Y. (2017) Generic decoding of seen and imagined objects using hierarchical visual features. Nature Communications 8:15037. https://www.nature.com/articles/ncomms15037
In this study, fMRI data was recorded while subjects were viewing object images (image presentation experiment) or were imagining object images (imagery experiment). The image presentation experiment consisted of two distinct types of sessions: training image sessions and test image sessions. In the training image session, a total of 1,200 images from 150 object categories (8 images from each category) were each presented only once (24 runs). In the test image session, a total of 50 images from 50 object categories (1 image from each category) were presented 35 times each (35 runs). All images were taken from ImageNet (http://www.image-net.org/, Fall 2011 release), a large-scale hierarchical image database. During the image presentation experiment, subjects performed one-back image repetition task (5 trials in each run). In the imagery experiment, subjects were required to visually imagine images from 1 of the 50 categories (20 runs; 25 categories in each run; 10 samples for each category) that were presented in the test image session of the image presentation experiment. fMRI data in the training image sessions were used to train models (decoders) which predict visual features from fMRI patterns, and those in the test image sessions and the imagery experiment were used to evaluate the model performance. Predicted features for the test image sessions and imagery experiment are used to identify seen/imagined object categories from a set of computed features for numerous object images.
Analysis demo code is available at GitHub (KamitaniLab/GenericObjectDecoding).
The present dataset contains fMRI data from five subjects ('sub-01', 'sub-02', 'sub-03', 'sub-04', and 'sub-05'). Each subject data contains three types of MRI data each of which was collected over multiple scanning sessions.
Each scanning session consisted of functional (EPI) and anatomical (inplane T2) data. The functional EPI images covered the entire brain (TR, 3000 ms; TE, 30 ms; flip angle, 80°; voxel size, 3 × 3 × 3 mm; FOV, 192 × 192 mm; number of slices, 50, slice gap, 0 mm) and inplane T2-weighted anatomical images were acquired with the same slices used for the EPI (TR, 7020 ms; TE, 69 ms; flip angle, 160°; voxel size, 0.75 × 0.75 × 3.0 mm; FOV, 192 × 192 mm). The dataset also includes a T1-weighted anatomical reference image for each subject (TR, 2250 ms; TE, 3.06 ms; TI, 900 ms; flip angle, 9°; voxel size, 1.0 × 1.0 × 1.0 mm; FOV, 256 × 256 mm). The T1-weighted images were scanned only once for each subject in a separate scanning session and are stored in 'ses-anatomy' directories. The T1-weighted images were defaced by pydeface (https://pypi.python.org/pypi/pydeface). All DICOM files are converted to Nifti-1 files by mri_convert in FreeSurfer. In addition, the dataset contains mask images of manually defined ROIs for each subject in 'sourcedata' directory (See 'README' in 'sourcedata' for more details).
Preprocessed fMRI data are available in derivatives/preproc-spm. See the original paper (Horikawa & Kamitani, 2017) for the details of preprocessing.
Task event files (‘sub-*_ses-*_task-*_run-*_events.tsv’) contains recorded event (stimuli presentation, subject responses, etc.) during fMRI runs. In task event files for perception task (‘ses-perceptionTraining' and 'ses-perceptionTest'), each column represents:
In task event files for imagery task ('ses-imageryTest'), each column represents:
The stimulus images are named as 'n03626115_19498' where 'n03626115' is ImageNet/WorNet ID for a synset (category) and '19498' is image ID. The categories are named as the ImageNet/WordNet sysnet ID (e.g., 'n03626115'). The stimulus and category names are included in the task event files as 'stimulus_name' and 'category_name', respectively. For use in analysis code, the task event files also contain 'stimulus_id' and 'category_id', which are float numbers generated based on the stimulus or category names (e.g., 'n03626115_19498' --> 3626115.019498).
The mapping between stimulus/category names and IDs:
Because of licensing issues, we do not include the stimulus images in the dataset. A script downloading the images from ImageNet is available at https://github.com/KamitaniLab/GenericObjectDecoding. Image features (CNN unit responses, HMAX, GIST, and SIFT) used in the original study are available at https://figshare.com/articles/Generic_Object_Decoding/7387130.
Facebook
TwitterIntroduction
Vessel segmentation in fundus images is essential in the diagnosis and prognosis of retinal diseases and the identification of image-based biomarkers. However, creating a vessel segmentation map can be a tedious and time consuming process, requiring careful delineation of the vasculature, which is especially hard for microcapillary plexi in fundus images. Optical coherence tomography angiography (OCT-A) is a relatively novel modality visualizing blood flow and microcapillary plexi not clearly observed in fundus photography. Unfortunately, current commercial OCT-A cameras have various limitations due to their complex optics making them more expensive, less portable, and with a reduced field of view (FOV) compared to fundus cameras. Moreover, the vast majority of population health data collection efforts do not include OCT-A data.
We believe that strategies able to map fundus images to en-face OCT-A can create precise vascular vessel segmentation with less effort.
In this dataset, called UTHealth - Fundus and Synthetic OCT-A Dataset (UT-FSOCTA), we include fundus images and en-face OCT-A images for 112 subjects. The two modalities have been manually aligned to allow for training of medical imaging machine learning pipelines. This dataset is accompanied by a manuscript that describes an approach to generate fundus vessel segmentations using OCT-A for training (Coronado et al., 2022). We refer to this approach as "Synthetic OCT-A".
Fundus Imaging
We include 45 degree macula centered fundus images that cover both macula and optic disc. All images were acquired using a OptoVue iVue fundus camera without pupil dilation.
The full images are available at the fov45/fundus directory. In addition, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/fundus/disc and cropped/fundus/macula.
Enface OCT-A
We include the en-face OCT-A images of the superficial capillary plexus. All images were acquired using an OptoVue Avanti OCT camera with OCT-A reconstruction software (AngioVue). Low quality images with errors in the retina layer segmentations were not included.
En-face OCTA images are located in cropped/octa/disc and cropped/octa/macula. In addition, we include a denoised version of these images where only vessels are included. This has been performed automatically using the ROSE algorithm (Ma et al. 2021). These can be found in cropped/GT_OCT_net/noThresh and cropped/GT_OCT_net/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.
Synthetic OCT-A
We train a custom conditional generative adversarial network (cGAN) to map a fundus image to an en face OCT-A image. Our model consists of a generator synthesizing en face OCT-A images from corresponding areas in fundus photographs and a discriminator judging the resemblance of the synthesized images to the real en face OCT-A samples. This allows us to avoid the use of manual vessel segmentation maps altogether.
The full images are available at the fov45/synthetic_octa directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/synthetic_octa/disc and cropped/synthetic_octa/macula. In addition, we performed the same denoising ROSE algorithm (Ma et al. 2021) used for the original enface OCT-A images, the results are available in cropped/denoised_synthetic_octa/noThresh and cropped/denoised_synthetic_octa/Thresh, the former contains the probabilities of the ROSE algorithm the latter a binary map.
Other Fundus Vessel Segmentations Included
In this dataset, we have also included the output of two recent vessel segmentation algorithms trained on external datasets with manual vessel segmentations. SA-Unet (Li et. al, 2020) and IterNet (Guo et. al, 2021).
SA-Unet. The full images are available at the fov45/SA_Unet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/SA_Unet/disc and cropped/SA_Unet/macula.
IterNet. The full images are available at the fov45/Iternet directory. Then, we extracted the FOVs corresponding to the en-face OCT-A images collected in cropped/Iternet/disc and cropped/Iternet/macula.
Train/Validation/Test Replication
In order to replicate or compare your model to the results of our paper, we report below the data split used.
Training subjects IDs: 1 - 25
Validation subjects IDs: 26 - 30
Testing subjects IDs: 31 - 112
Data Acquisition
This dataset was acquired at the Texas Medical Center - Memorial Hermann Hospital in accordance with the guidelines from the Helsinki Declaration and it was approved by the UTHealth IRB with protocol HSC-MS-19-0352.
User Agreement
The UT-FSOCTA dataset is free to use for non-commercial scientific research only. In case of any publication the following paper needs to be cited
Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
Funding
This work is supported by the Translational Research Institute for Space Health through NASA Cooperative Agreement NNX16AO69A.
Research Team and Acknowledgements
Here are the people behind this data acquisition effort:
Ivan Coronado, Samiksha Pachade, Rania Abdelkhaleq, Juntao Yan, Sergio Salazar-Marioni, Amanda Jagolino, Mozhdeh Bahrainian, Roomasa Channa, Sunil Sheth, Luca Giancardo
We would also like to acknowledge for their support: the Institute for Stroke and Cerebrovascular Diseases at UTHealth, the VAMPIRE team at University of Dundee, UK and Memorial Hermann Hospital System.
References
Coronado I, Pachade S, Trucco E, Abdelkhaleq R, Yan J, Salazar-Marioni S, Jagolino-Cole A, Bahrainian M, Channa R, Sheth SA, Giancardo L. Synthetic OCT-A blood vessel maps using fundus images and generative adversarial networks. Sci Rep 2023;13:15325. https://doi.org/10.1038/s41598-023-42062-9.
C. Guo, M. Szemenyei, Y. Yi, W. Wang, B. Chen, and C. Fan, "SA-UNet: Spatial Attention U-Net for Retinal Vessel Segmentation," in 2020 25th International Conference on Pattern Recognition (ICPR), Jan. 2021, pp. 1236–1242. doi: 10.1109/ICPR48806.2021.9413346.
L. Li, M. Verma, Y. Nakashima, H. Nagahara, and R. Kawasaki, "IterNet: Retinal Image Segmentation Utilizing Structural Redundancy in Vessel Networks," 2020 IEEE Winter Conf. Appl. Comput. Vis. WACV, 2020, doi: 10.1109/WACV45572.2020.9093621.
Y. Ma et al., "ROSE: A Retinal OCT-Angiography Vessel Segmentation Dataset and New Model," IEEE Trans. Med. Imaging, vol. 40, no. 3, pp. 928–939, Mar. 2021, doi: 10.1109/TMI.2020.3042802.
Facebook
TwitterColor pattern variation provides biological information in fields ranging from disease ecology to speciation dynamics. Comparing color pattern geometries across images requires color segmentation, where pixels in an image are assigned to one of a set of color classes shared by all images. Manual methods for color segmentation are slow and subjective, while automated methods can struggle with high technical variation in aggregate image sets. We present recolorize, an R package toolbox for human-subjective color segmentation with functions for batch-processing low-variation image sets and additional tools for handling images from diverse (high variation) sources. The package also includes export options for a variety of formats and color analysis packages. This paper illustrates recolorize for three example datasets, including high variation, batch processing, and combining with reflectance spectra, and demonstrates the downstream use of methods that rely on this output., The included dataset is a copy of the code and images found in the recolorize_examples GitHub repository as of January 2024: https://github.com/hiweller/recolorize_examples The code includes all steps necessary to recreate the analyses presented in the paper. Images were sourced as follows: Figure 1: Images of Chrysochroa beetles by Nathan P. Lord (coauthor). Figure 2: Pygoplites diacanthus image from John E. Randall/Bishop Museum (http://pbs.bishopmuseum.org/images/JER/detail.asp?size=i&cols=10&ID=1432937402). Figure 3: Images of Neolamprologus fishes taken by Ad Konings, used with his kind permission. Figure 4: Images of Polistes fuscatus wasps taken by James Tumulty, and are a subset of the images used for analysis in Tumulty et al. (2023): https://doi.org/10.1016/j.cub.2023.11.032 Figure 5: Images of Diglossa birds taken by Anna E. Hiller (coauthor). , , # Code and example images from 'recolorize: An R package for flexible color segmentation of biological images'
This repository contains example files and code from the methods paper (Weller, Hiller, Lord, and Van Belleghem, 2024) describing how to use the recolorize R package ().
All examples were developed and tested using R version 4.3.2 (2023-10-31, "Eye Holes"), and R version >3.5.0 is recommended.
The main directory contains folders for each of the examples demonstrated in the paper (01_beetles, 03_cichlids, 04_wasps, and 05_birds), as well as two R scripts (patPCA_total.R and 00_installation_RUN_ME_FIRST.R) and an R project file (recolorize_examples.Rproj).
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it. However, their optimization is computationally demanding and requires careful hyperparameter tuning. To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches. It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations. This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations.
We release our dataset as a set of folders indicating the patch target label (e.g., banana), each containing 1000 subfolders as the ImageNet output classes.
An example showing how to use the dataset is shown below.
import os.path
from torchvision import datasets, transforms, models import torch.utils.data
class ImageFolderWithEmptyDirs(datasets.ImageFolder): """ This is required for handling empty folders from the ImageFolder Class. """
def find_classes(self, directory):
classes = sorted(entry.name for entry in os.scandir(directory) if entry.is_dir())
if not classes:
raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
class_to_idx = {cls_name: i for i, cls_name in enumerate(classes) if
len(os.listdir(os.path.join(directory, cls_name))) > 0}
return classes, class_to_idx
dataset_folder = 'data/ImageNet-Patch'
available_labels = { 487: 'cellular telephone', 513: 'cornet', 546: 'electric guitar', 585: 'hair spray', 804: 'soap dispenser', 806: 'sock', 878: 'typewriter keyboard', 923: 'plate', 954: 'banana', 968: 'cup' }
target_label = 954
dataset_folder = os.path.join(dataset_folder, str(target_label)) normalizer = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) transforms = transforms.Compose([ transforms.ToTensor(), normalizer ])
dataset = ImageFolderWithEmptyDirs(dataset_folder, transform=transforms) model = models.resnet50(pretrained=True) loader = torch.utils.data.DataLoader(dataset, shuffle=True, batch_size=5) model.eval()
batches = 10 correct, attack_success, total = 0, 0, 0 for batch_idx, (images, labels) in enumerate(loader): if batch_idx == batches: break pred = model(images).argmax(dim=1) correct += (pred == labels).sum() attack_success += sum(pred == target_label) total += pred.shape[0]
accuracy = correct / total attack_sr = attack_success / total
print("Robust Accuracy: ", accuracy) print("Attack Success: ", attack_sr)
Facebook
TwitterILSVRC 2012, commonly known as 'ImageNet' is an image dataset organized according to the WordNet hierarchy. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are more than 100,000 synsets in WordNet, majority of them are nouns (80,000+). In ImageNet, we aim to provide on average 1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. In its completion, we hope ImageNet will offer tens of millions of cleanly sorted images for most of the concepts in the WordNet hierarchy.
The test split contains 100K images but no labels because no labels have been publicly released. We provide support for the test split from 2012 with the minor patch released on October 10, 2019. In order to manually download this data, a user must perform the following operations:
The resulting tar-ball may then be processed by TFDS.
To assess the accuracy of a model on the ImageNet test split, one must run inference on all images in the split, export those results to a text file that must be uploaded to the ImageNet evaluation server. The maintainers of the ImageNet evaluation server permits a single user to submit up to 2 submissions per week in order to prevent overfitting.
To evaluate the accuracy on the test split, one must first create an account at image-net.org. This account must be approved by the site administrator. After the account is created, one can submit the results to the test server at https://image-net.org/challenges/LSVRC/eval_server.php The submission consists of several ASCII text files corresponding to multiple tasks. The task of interest is "Classification submission (top-5 cls error)". A sample of an exported text file looks like the following:
771 778 794 387 650
363 691 764 923 427
737 369 430 531 124
755 930 755 59 168
The export format is described in full in "readme.txt" within the 2013 development kit available here: https://image-net.org/data/ILSVRC/2013/ILSVRC2013_devkit.tgz Please see the section entitled "3.3 CLS-LOC submission format". Briefly, the format of the text file is 100,000 lines corresponding to each image in the test split. Each line of integers correspond to the rank-ordered, top 5 predictions for each test image. The integers are 1-indexed corresponding to the line number in the corresponding labels file. See labels.txt.
To use this dataset:
import tensorflow_datasets as tfds
ds = tfds.load('imagenet2012', split='train')
for ex in ds.take(4):
print(ex)
See the guide for more informations on tensorflow_datasets.
https://storage.googleapis.com/tfds-data/visualization/fig/imagenet2012-5.1.0.png" alt="Visualization" width="500px">
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Dataset Card for Car Images Dataset
Dataset Description
This dataset contains a collection of car images designed to support tasks such as image classification, car detection, and autonomous driving research. The images feature cars in various settings, including streets and other environments, captured to provide diverse training data for computer vision models. The dataset aims to enable the development and benchmarking of models that can:
Identify different types… See the full description on the dataset page: https://huggingface.co/datasets/KAI-KratosAI/car-images.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The CIFAR-10 and CIFAR-100 dataset contains labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
* More info on CIFAR-100: https://www.cs.toronto.edu/~kriz/cifar.html
* TensorFlow listing of the dataset: https://www.tensorflow.org/datasets/catalog/cifar100
* GitHub repo for converting CIFAR-100 tarball files to png format: https://github.com/knjcode/cifar2png
The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images [in the original dataset].
This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs). However, this project does not contain the superclasses.
* Superclasses version: https://universe.roboflow.com/popular-benchmarks/cifar100-with-superclasses/
More background on the dataset:
https://i.imgur.com/5w8A0Vm.png" alt="CIFAR-100 Dataset Classes and Superclassees">
train (83.33% of images - 50,000 images) set and test (16.67% of images - 10,000 images) set only.train set split to provide 80% of its images to the training set (approximately 40,000 images) and 20% of its images to the validation set (approximately 10,000 images)@TECHREPORT{Krizhevsky09learningmultiple,
author = {Alex Krizhevsky},
title = {Learning multiple layers of features from tiny images},
institution = {},
year = {2009}
}
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
For more details and the most up-to-date information please consult our project page: https://kainmueller-lab.github.io/fisbe.
Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging, thin filamentous and widely branching morphologies, multiple neurons are tightly inter-weaved, and partial volume effects, uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research, namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing, to date methods are typically benchmarked on synthetic datasets. To address this gap, we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset, the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition, we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly, we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies, and facilitate scientific discovery in basic neuroscience.
We provide a detailed documentation of our dataset, following the Datasheet for Datasets questionnaire:
Our dataset originates from the FlyLight project, where the authors released a large image collection of nervous systems of ~74,000 flies, available for download under CC BY 4.0 license.
Each sample consists of a single 3d MCFO image of neurons of the fruit fly.
For each image, we provide a pixel-wise instance segmentation for all separable neurons.
Each sample is stored as a separate zarr file (zarr is a file storage format for chunked, compressed, N-dimensional arrays based on an open-source specification.").
The image data ("raw") and the segmentation ("gt_instances") are stored as two arrays within a single zarr file.
The segmentation mask for each neuron is stored in a separate channel.
The order of dimensions is CZYX.
We recommend to work in a virtual environment, e.g., by using conda:
conda create -y -n flylight-env -c conda-forge python=3.9conda activate flylight-env
pip install zarr
import zarrraw = zarr.open(seg = zarr.open(
# optional:import numpy as npraw_np = np.array(raw)
Zarr arrays are read lazily on-demand.
Many functions that expect numpy arrays also work with zarr arrays.
Optionally, the arrays can also explicitly be converted to numpy arrays.
We recommend to use napari to view the image data.
pip install "napari[all]"
import zarr, sys, napari
raw = zarr.load(sys.argv[1], mode='r', path="volumes/raw")gts = zarr.load(sys.argv[1], mode='r', path="volumes/gt_instances")
viewer = napari.Viewer(ndisplay=3)for idx, gt in enumerate(gts): viewer.add_labels( gt, rendering='translucent', blending='additive', name=f'gt_{idx}')viewer.add_image(raw[0], colormap="red", name='raw_r', blending='additive')viewer.add_image(raw[1], colormap="green", name='raw_g', blending='additive')viewer.add_image(raw[2], colormap="blue", name='raw_b', blending='additive')napari.run()
python view_data.py
For more information on our selected metrics and formal definitions please see our paper.
To showcase the FISBe dataset together with our selection of metrics, we provide evaluation results for three baseline methods, namely PatchPerPix (ppp), Flood Filling Networks (FFN) and a non-learnt application-specific color clustering from Duan et al..
For detailed information on the methods and the quantitative results please see our paper.
The FlyLight Instance Segmentation Benchmark (FISBe) dataset is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
If you use FISBe in your research, please use the following BibTeX entry:
@misc{mais2024fisbe,
title = {FISBe: A real-world benchmark dataset for instance
segmentation of long-range thin filamentous structures},
author = {Lisa Mais and Peter Hirsch and Claire Managan and Ramya
Kandarpa and Josef Lorenz Rumberger and Annika Reinke and Lena
Maier-Hein and Gudrun Ihrke and Dagmar Kainmueller},
year = 2024,
eprint = {2404.00130},
archivePrefix ={arXiv},
primaryClass = {cs.CV}
}
We thank Aljoscha Nern for providing unpublished MCFO images as well as Geoffrey W. Meissner and the entire FlyLight Project Team for valuable
discussions.
P.H., L.M. and D.K. were supported by the HHMI Janelia Visiting Scientist Program.
This work was co-funded by Helmholtz Imaging.
There have been no changes to the dataset so far.
All future change will be listed on the changelog page.
If you would like to contribute, have encountered any issues or have any suggestions, please open an issue for the FISBe dataset in the accompanying github repository.
All contributions are welcome!
Facebook
TwitterThis is a subset of the Zenodo-ML Dinosaur Dataset [Github] that has been converted to small png files and organized in folders by the language so you can jump right in to using machine learning methods that assume image input.
Included are .tar.gz files, each named based on a file extension, and when extracted, will produce a folder of the same name.
tree -L 1
.
├── c
├── cc
├── cpp
├── cs
├── css
├── csv
├── cxx
├── data
├── f90
├── go
├── html
├── java
├── js
├── json
├── m
├── map
├── md
├── txt
└── xml
And we can peep inside a (somewhat smaller) of the set to see that the subfolders are zenodo identifiers. A zenodo identifier corresponds to a single Github repository, so it means that the png files produced are chunks of code of the extension type from a particular repository.
$ tree map -L 1
map
├── 1001104
├── 1001659
├── 1001793
├── 1008839
├── 1009700
├── 1033697
├── 1034342
...
├── 836482
├── 838329
├── 838961
├── 840877
├── 840881
├── 844050
├── 845960
├── 848163
├── 888395
├── 891478
└── 893858
154 directories, 0 files
Within each folder (zenodo id) the files are prefixed by the zenodo id, followed by the index into the original image set array that is provided with the full dinosaur dataset archive.
$ tree m/891531/ -L 1
m/891531/
├── 891531_0.png
├── 891531_10.png
├── 891531_11.png
├── 891531_12.png
├── 891531_13.png
├── 891531_14.png
├── 891531_15.png
├── 891531_16.png
├── 891531_17.png
├── 891531_18.png
├── 891531_19.png
├── 891531_1.png
├── 891531_20.png
├── 891531_21.png
├── 891531_22.png
├── 891531_23.png
├── 891531_24.png
├── 891531_25.png
├── 891531_26.png
├── 891531_27.png
├── 891531_28.png
├── 891531_29.png
├── 891531_2.png
├── 891531_30.png
├── 891531_3.png
├── 891531_4.png
├── 891531_5.png
├── 891531_6.png
├── 891531_7.png
├── 891531_8.png
└── 891531_9.png
0 directories, 31 files
So what's the difference?
The difference is that these files are organized by extension type, and provided as actual png images. The original data is provided as numpy data frames, and is organized by zenodo ID. Both are useful for different things - this particular version is cool because we can actually see what a code image looks like.
How many images total?
We can count the number of total images:
find "." -type f -name *.png | wc -l
3,026,993
The script to create the dataset is provided here. Essentially, we start with the top extensions as identified by this work (excluding actual images files) and then write each 80x80 image to an actual png image, organizing by extension then zenodo id (as shown above).
I tested a few methods to write the single channel 80x80 data frames as png images, and wound up liking cv2's imwrite function because it would save and then load the exact same content.
import cv2
cv2.imwrite(image_path, image)
Given the above, it's pretty easy to load an image! Here is an example using scipy, and then for newer Python (if you get a deprecation message) using imageio.
image_path = '/tmp/data1/data/csv/1009185/1009185_0.png'
from imageio import imread
image = imread(image_path)
array([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
image.shape
(80,80)
# Deprecated
from scipy import misc
misc.imread(image_path)
Image([[116, 105, 109, ..., 32, 32, 32],
[ 48, 44, 48, ..., 32, 32, 32],
[ 48, 46, 49, ..., 32, 32, 32],
...,
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32],
[ 32, 32, 32, ..., 32, 32, 32]], dtype=uint8)
Remember that the values in the data are characters that have been converted to ordinal. Can you guess what 32 is?
ord(' ')
32
# And thus if you wanted to convert it back...
chr(32)
So how t...