32 datasets found

Downsized camera trap images for automated classification
zenodo.org
bin, zip
Updated Dec 1, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707
Explore at:
bin, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.6627707
Dataset updated
Dec 1, 2022
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
filename: Root ID (Field type: id)
camera_trap_site: Site ID for the camera trap location (Field type: location)
taxon: Taxon recorded by camera trap (Field type: taxa)
dist_level: Level of disturbance at site (Field type: ordered categorical)
baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:
h
Data from: depression-detection
huggingface.co
Updated May 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Cristian B (2025). depression-detection [Dataset]. https://huggingface.co/datasets/thePixel42/depression-detection
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
May 7, 2025
Authors
Cristian B
Description
This dataset contains a collection of posts from Reddit. The posts have been collected from 3 subreddits: r/teenagers, r/SuicideWatch, and r/depression. There are 140,000 labeled posts for training and 60,000 labeled posts for testing. Both training and testing datasets have an equal split of labels. This dataset is not mine. The original dataset is on Kaggle: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch/versions/13
S
Galaxy, star, quasar dataset
scidb.cn
Updated Feb 3, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Li Xin (2023). Galaxy, star, quasar dataset [Dataset]. http://doi.org/10.57760/sciencedb.07177
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Unique identifier
https://doi.org/10.57760/sciencedb.07177
Dataset updated
Feb 3, 2023
Dataset provided by
Science Data Bank
Authors
Li Xin
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.
f
physioDL: A dataset for geomorphic deep learning representing a scene...
figshare.com
zip
Updated Jul 24, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Aaron Maxwell (2024). physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hillshade occurs) [Dataset]. http://doi.org/10.6084/m9.figshare.26363824.v2
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.26363824.v2
Dataset updated
Jul 24, 2024
Dataset provided by
figshare
Authors
Aaron Maxwell
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hilshade occurs)Purpose: Datasets for geomorphic deep learning. Predict the physiographic region of an area based on a hillshade image. Terrain data were derived from the 30 m (1 arc-second) 3DEP product across the entirety of CONUS. Each chip has a spatial resolution of 30 m and 256 rows and columns of pixels. As a result, each chip measures 7,680 meters-by-7,680 meters. Two datasets are provided. Chips in the hs folder represent a multidirectional hillshade while chips in the ths folder represent a tinted multidirectional hillshade. Data are represented in 8-bit (0 to 255 scale, integer values). Data are projected to the Web Mercator projection relative to the WGS84 datum. Data were split into training, test, and validation partitions using stratified random sampling by region. 70% of the samples per region were selected for training, 15% for testing, and 15% for validation. There are a total of 16,325 chips. The following 22 physiographic regions are represented: "ADIRONDACK" , "APPALACHIAN PLATEAUS", "BASIN AND RANGE", "BLUE RIDGE", "CASCADE-SIERRA MOUNTAINS", "CENTRAL LOWLAND", "COASTAL PLAIN", "COLORADO PLATEAUS", "COLUMBIA PLATEAU", "GREAT PLAINS", "INTERIOR LOW PLATEAUS", "MIDDLE ROCKY MOUNTAINS", "NEW ENGLAND", "NORTHERN ROCKY MOUNTAINS", "OUACHITA", "OZARK PLATEAUS", "PACIFIC BORDER", and "PIEDMONT", "SOUTHERN ROCKY MOUNTAINS", "SUPERIOR UPLAND", "VALLEY AND RIDGE", "WYOMING BASIN". Input digital terrain models and hillshades are not provided due to the large file size (> 100GB). FilesphysioDL.csv: Table listing all image chips and associated physiographic region (id = unique ID for each chip; region = physiographic region; fnameHS = file name of associated chip in hs folder; fnameTHS = file name of associated chip in ths folder; set = data split (train, test, or validation).chipCounts.csv: Number of chips in each data partition per physiographic province. map.png: Map of data.makeChips.R: R script used to process the data into image chips and create CSV files.inputVectorschipBounds.shp = square extent of each chipchipCenters.shp = center coordinate of each chipprovinces.shp = physiographic provincesprovinces10km.shp = physiographic provinces with a 10 km negative buffer
P
V2 Balloon Detection Dataset Dataset
paperswithcode.com
Updated Sep 5, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2024). V2 Balloon Detection Dataset Dataset [Dataset]. https://paperswithcode.com/dataset/v2-balloon-detection-dataset
Explore at:
Dataset updated
Sep 5, 2024
Description
Description:

👉 Download the dataset here

This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which require expert interpretation to evaluate model performance, the simplicity of balloon detection enables users to visually verify predictions without domain expertise.

The original Balloon dataset was more complex, as it was split into separate training and testing sets, with annotations stored in two separate JSON files. To streamline the experience, this updated version of the dataset merges all images into a single folder and replaces the JSON annotations with a single, easy-to-use CSV file. This new format ensures that the dataset can be loaded seamlessly with tools like Pandas, simplifying the workflow for researchers and developers.

Download Dataset

The dataset contains a total of 74 high-quality JPG images. Each featuring one or more balloons in different scenes and contexts. Accompanying the images is a CSV file that provides annotation data. Such as bounding box coordinates and labels for each balloon within the images. This structure makes the dataset easily accessible for a range of machine learning and computer vision tasks. Including object detection and image classification. The dataset is versatile and can be used to test algorithms like YOLO, Faster R-CNN, SSD, or other popular object detection models.

Key Features:

Image Format: 74 JPG images, ensuring high compatibility with most machine learning frameworks.

Annotations: A single CSV file that contains structure data. Including bounding box coordinates, class labels, and image file names, which can be load with Python libraries like Pandas.

Simplicity: Design for users to quickly start training object detection models without needing to preprocess or deeply explore the dataset.

Variety: The images feature balloons in various sizes, colors, and scenes, making it suitable for testing the robustness of detection models.

This dataset is sourced from Kaggle.
Data from: UC San Diego Resting State EEG Data from Patients with...
openneuro.org
Updated Dec 10, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alexander P. Rockhill; Nicko Jackson; Jobi George; Adam Aron; Nicole C. Swann (2021). UC San Diego Resting State EEG Data from Patients with Parkinson's Disease [Dataset]. http://doi.org/10.18112/openneuro.ds002778.v1.0.5
Explore at:
Unique identifier
https://doi.org/10.18112/openneuro.ds002778.v1.0.5
Dataset updated
Dec 10, 2021
Dataset provided by
OpenNeurohttps://openneuro.org/
Authors
Alexander P. Rockhill; Nicko Jackson; Jobi George; Adam Aron; Nicole C. Swann
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Area covered
San Diego
Description
Welcome to the resting state EEG dataset collected at the University of San Diego and curated by Alex Rockhill at the University of Oregon.

Please email arockhil@uoregon.edu before submitting a manuscript to be published in a peer-reviewed journal using this data, we wish to ensure that the data to be analyzed and interpreted with scientific integrity so as not to mislead the public about findings that may have clinical relevance. The purpose of this is to be responsible stewards of the data without an "available upon reasonable request" clause that we feel doesn't fully represent the open-source, reproducible ethos. The data is freely available to download so we cannot stop your publication if we don't support your methods and interpretation of findings, however, in being good data stewards, we would like to offer suggestions in the pre-publication stage so as to reduce conflict in published scientific literature. As far as credit, there is precedent for receiving a mention in the acknowledgements section for reading and providing feedback on the paper or, for more involved consulting, being included as an author may be warranted. The purpose of asking for this is not to inflate our number of authorships; we take ethical considerations of the best way to handle intellectual property in the form of manuscripts very seriously, and, again, sharing is at the discretion of the author although we strongly recommend it. Please be ethical and considerate in your use of this data and all open-source data and be sure to credit authors by citing them.

An example of an analysis that we could consider problematic and would strongly advice to be corrected before submission to a publication would be using machine learning to classify Parkinson's patients from healthy controls using this dataset. This is because there are far too few patients for proper statistics. Parkinson's disease presents heterogeneously across patients, and, with a proper test-training split, there would be fewer than 8 patients in the testing set. Statistics on 8 or fewer patients for such a complicated diease would be inaccurate due to having too small of a sample size. Furthermore, if multiple machine learning algorithms were desired to be tested, a third split would be required to choose the best method, further lowering the number of patients in the testing set. We strongly advise against using any such approach because it would mislead patients and people who are interested in knowing if they have Parkinson's disease.

Note that UPDRS rating scales were collected by laboratory personnel who had completed online training and not a board-certified neurologist. Results should be interpreted accordingly, especially that analyses based largely on these ratings should be taken with the appropriate amount of uncertainty.

In addition to contacting the aforementioned email, please cite the following papers:

Nicko Jackson, Scott R. Cole, Bradley Voytek, Nicole C. Swann. Characteristics of Waveform Shape in Parkinson's Disease Detected with Scalp Electroencephalography. eNeuro 20 May 2019, 6 (3) ENEURO.0151-19.2019; DOI: 10.1523/ENEURO.0151-19.2019.

Swann NC, de Hemptinne C, Aron AR, Ostrem JL, Knight RT, Starr PA. Elevated synchrony in Parkinson disease detected with electroencephalography. Ann Neurol. 2015 Nov;78(5):742-50. doi: 10.1002/ana.24507. Epub 2015 Sep 2. PMID: 26290353; PMCID: PMC4623949.

George JS, Strunk J, Mak-McCully R, Houser M, Poizner H, Aron AR. Dopaminergic therapy in Parkinson's disease decreases cortical beta band coherence in the resting state and increases cortical beta band power during executive control. Neuroimage Clin. 2013 Aug 8;3:261-70. doi: 10.1016/j.nicl.2013.07.013. PMID: 24273711; PMCID: PMC3814961.

Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896).

Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8.

Note: see this discussion on the structure of the json files that is sufficient but not optimal and will hopefully be changed in future versions of BIDS: https://neurostars.org/t/behavior-metadata-without-tsv-event-data-related-to-a-neuroimaging-data/6768/25.
Z
Data for the manuscript "Spatially resolved uncertainties for machine...
data.niaid.nih.gov
zenodo.org
Updated Aug 1, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Schörghuber, Johannes (2024). Data for the manuscript "Spatially resolved uncertainties for machine learning potentials" [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_11086346
Explore at:
Dataset updated
Aug 1, 2024
Dataset provided by
Madsen, Georg K. H.
Wanzenböck, Ralf
Heid, Esther
Schörghuber, Johannes
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:

mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).

aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.

data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].

data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.

data_h2o.tar.gz contains:

full_db.extxyz: The full dataset of 1.5k structures.

iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.

the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.
Mathematics Dataset
github.com
opendatalab.com
+1more
Updated Apr 3, 2019
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
DeepMind (2019). Mathematics Dataset [Dataset]. https://github.com/Wikidepia/mathematics_dataset_id
Explore at:
Dataset updated
Apr 3, 2019
Dataset provided by
DeepMindhttp://deepmind.com/
Description
This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.

## Example questions

Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r. Answer: 4 Question: Calculate -841880142.544 + 411127. Answer: -841469015.544 Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)). Answer: 54*a - 30

It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:

algebra (linear equations, polynomial roots, sequences)

arithmetic (pairwise operations and mixed expressions, surds)

calculus (differentiation)

comparison (closest numbers, pairwise comparisons, sorting)

measurement (conversion, working with time)

numbers (base conversion, remainders, common divisors and multiples, primality, place value, rounding numbers)

polynomials (addition, simplification, composition, evaluating, expansion)

probability (sampling without replacement)
f
Table_1_Using machine learning for crop yield prediction in the past or the...
frontiersin.figshare.com
xlsx
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Alejandro Morales; Francisco J. Villalobos (2023). Table_1_Using machine learning for crop yield prediction in the past or the future.xlsx [Dataset]. http://doi.org/10.3389/fpls.2023.1128388.s001
Explore at:
xlsxAvailable download formats
Unique identifier
https://doi.org/10.3389/fpls.2023.1128388.s001
Dataset updated
Jun 20, 2023
Dataset provided by
Frontiers
Authors
Alejandro Morales; Francisco J. Villalobos
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.
Transfer learning reveals sequence determinants of the quantitative response...
zenodo.org
application/gzip, zip
Updated May 29, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sahin Naqvi; Sahin Naqvi (2024). Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage [Dataset]. http://doi.org/10.5281/zenodo.11224809
Explore at:
application/gzip, zipAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11224809
Dataset updated
May 29, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Sahin Naqvi; Sahin Naqvi
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.

Directory is organized into 4 subfolders, each tar'ed and gzipped:

data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage

atac_design.txt - design matrix for ATAC-seq TWIST1 titration samples

all.sub.150bpclust.greater2.500bp.merge.TWIST1.titr.ATAC.counts.txt - ATAC-seq counts from all samples over all reproducible ATAC-seq peak regions, as defined in Naqvi et al 2023

atac_deseq_fitmodels_moded50.R - R code for calculating new version of ED50 and response to full depletion from TWIST1 titration data (note, uses drm.R function from 10.5281/zenodo.7689948, install drc() with this version to avoid errors)

baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage

{sox9|twist1}.{0v100|ed50}.{train|valid|test}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds

HOCOMOCOv11_core_HUMAN_mono_jaspar_format.all.sub.150bpclust.greater2.500bp.merge.minus300bp.p01.maxscore.mat.cpg.gc.basemean.txt.gz - matrix of predictors for all REs. Quantitative encoding of PWM match for all HOCOMOCO motifs + CpG + GC content, plus unperturbed ATAC-seq signal

train_baseline.R - R code to train baseline (LASSO regression or random forest) models using predictor matrix and the provided training data.

Note: training the random forest to predict full TF depletion is computationally intensive because it is across all REs, if doing this run on CPU for ~6 hrs.

chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage

Fine-tuning code, data, models

{all|sox9.direct|twist1.bound.down}.{train|valid|test}.{ed50|0v100.log2fc}.txt - Training/testing/validation data (ED50 or full TF depletion effect for SOX9 or TWIST1), split into train/test/validation folds

pretrained.unperturbed.chrombpnet.h5 - Pretrained model of unperturbed ATAC-seq signal in CNCCs, obtained by running ChromBPNet (https://github.com/kundajelab/chrombpnet) on DMSO-treated SOX9/TWIST1-tagged ATAC-seq data

finetune_chrombpnet.py - code for fine-tuning the pretrained model for any of the relevant prediction tasks (ED50/ effect of full TF depletion for SOX9/TWIST1)

best.model.chrombpnet.{0v100|ed50}.{sox9|twist1}.h5 - output of finetune_chrombpnet.py, best model after 10 training epochs for the indicated task

chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.{h5|bw} - contribution scores for the indicated predictive model, obtained by running chrombpnet contribs_bw on the corresponding model h5 file.

chrombpnet.{0v100|ed50}.{sox9|twist1}.contrib.modisco.{h5|bw} - TF-MoDIsCo output from the corresponding contribution score file

Interpretation code, data, models

contrib_h5_to_projshap_npy.py - code to convert contrib .h5 files into .npy files containing projected SHAP scores (required because the CWM matching code takes this format of contribution scores)

sox9.direct.10col.bed, twist1.bound.down.10col.uniq.bed - regions over which CWMs will be matched (likely direct targets of each TF)

match_cwms.py - Python code to match individual CWM instances. Takes as input: modisco .h5 file, SHAP .npy file, bed file of regions to be matched. Output is a bed file of all CWM matches (not pruned, contains many redundant matches).

chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.bed - output of match_cwms.py

take_max_overlap.py - code to merge output of match_cwms.py into clusters, and then take the maximum (length-normalized) match score in each cluster as the representative CWM match of that cluster. Requires upstream bedtools commands to be piped in, see example usage in file.

chrombpnet.ed50.{sox9|twist1}.contrib.perc05.matchperc10.allmatch.maxoverlap.bed - output of take_max_overlap.py. These CWM instances are the ones used throughout the paper.

modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models

modisco_report_{sox9|twist1}_{0v100|ed50}: folders containing images of discovered CWMs and HTMLs/PDFs of summarized reports from running TF-MoDisCo on the indicated fine-tuned ChromBPNet model

mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves

twist1.strong.multi.only.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.

twist1.strong.weak{1|2|3}.ed50.cutoff.true.hill.txt - ED50 and signed hill coefficients for all TWIST1-dependent REs with only buffering Coordinators (mostly one or two) and the indicated number of sensitizing (weak) Coordinators and no other TFs' binding sites. "ed50_new" is the ED50 calculation used in this paper.

MirnyModelAnalysis.py - Python code for analysis of Mirny model of TF-nucleosome competition. Contains implementations of analytic solutions, as well as code to fit model to observed ED50 and hill coefficients in the provided data files.
Z by HP Unlocked Challenge 3 - Signal Processing
kaggle.com
Updated Feb 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ken Jee (2022). Z by HP Unlocked Challenge 3 - Signal Processing [Dataset]. https://www.kaggle.com/datasets/kenjee/z-by-hp-unlocked-challenge-3-signal-processing/data
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Feb 17, 2022
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Ken Jee
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Z by HP Unlocked Challenge 3

Z by HP Unlocked Challenge 3 - Audio Recognition - Special thanks to Hunter Kempf for helping create this challenge! Watch the tutorial video here: https://youtu.be/9Txxl0FJZas

https://1c7gnu28cnefcask1kjk6k1d-wpengine.netdna-ssl.com/wp-content/uploads/2013/11/Capuchinbird.png" alt="CapuchinBird">

The Task

The Challenge is to build a Machine Learning model and code to count the number of Capuchinbird calls within a given clip. This can be done in a variety of ways and we would recommend that you do some research into various methods of audio recognition.

What is Unlocked?

Unlocked is an action-packed interactive film made by Z by HP for data scientists. Sharpen your skills and solve the data driven mystery here: https://www.hp.com/us-en/workstations/industries/data-science/unlocked-challenge.html

The Data

The Data is split into Training and Testing Data. For Training Data we have provided enough clips to get a decent model but you can also find, parse, augment and use additional audio clips to improve your model performance.

Training Sets

In order to download and properly build our Training sets we have provided details and some example code for how to interact with the files.

Download Capuchinbird Calls: def url_response(path_url_list): path, url = path_url_list r = requests.get(url, stream = True) with open(path, 'wb') as f: for ch in r: f.write(ch) def make_path_and_url(clip_id): file_path = os.path.join("Raw_Capuchinbird_Clips",f"XC{clip_id} - Capuchinbird - Perissocephalus tricolor.mp3") url = f"https://xeno-canto.org/{clip_id}/download" return file_path, url clip_ids = ['114131', '114132', '119294', '16803', '16804', '168899', '178167', '178168', '201990', '216010', '216012', '22397', '227467', '227468', '227469', '227471', '27881', '27882', '307385', '336661', '3776', '387509', '388470', '395129', '395130', '401294', '40355', '433953', '44070', '441733', '441734', '456236', '456314', '46077', '46241', '479556', '493092', '495697', '504926', '504928', '513083', '520626', '526106', '574020', '574021', '600460', '65195', '65196', '79965', '9221', '98557', '9892', '9893'] paths_and_urls = list(map(make_path_and_url, clip_ids)) ThreadPool(4).imap_unordered(url_response, paths_and_urls)

Parsing_Single_Call_Timestamps.csv - Clip Timestamps where Capuchinbird Calls are audible

id: xeno-canto.org clip id

start: Start time of single call in seconds

end: End time of single call in seconds def parse_capuchinbird_clips(clip_tuple): """ Parses the audio clips described by the clip_tuple into the Parsed_Capuchinbird_Clips folder """ clip_id, starts_and_ends = clip_tuple ms_to_seconds = 1000 mp3_filename = os.path.join("Raw_Capuchinbird_Clips",f"XC{clip_id} - Capuchinbird - Perissocephalus tricolor.mp3") sound = AudioSegment.from_mp3(mp3_filename) count = 0 for start, end in starts_and_ends: sub_clip = sound[start*ms_to_seconds:end*ms_to_seconds] sub_clip_name = f"XC{clip_id}-{count}" sub_clip.export(os.path.join("Parsed_Capuchinbird_Clips",f"{sub_clip_name}.wav"), format="wav") count += 1 def df_to_list_of_call_tuples(df): """ Extracts a list of tuples from the provided Parsing_Single_Call_Timestamps csv file """ output = [] for clip_id in df["id"].unique(): clip_df = df[df["id"]==clip_id].copy() starts = clip_df["start"].tolist() ends = clip_df["end"].tolist() clip_list = [] for i in range(len(starts)): clip_list.append((starts[i],ends[i])) output.append((clip_id,clip_list)) return output calls_df = pd.read_csv("Parsing_Single_Call_Timestamps.csv") calls_list = df_to_list_of_call_tuples(calls_df) ThreadPool(4).imap_unordered(parse_capuchinbird_clips, calls_list)

Other_Sound_Urls.csv - Other Birds, Animals and Forest Noises

id: Sequentially increasing clip id

url: Link to freesoundslibrary.com zip file of clip ``` def download_and_unzip_sounds(id_url_list): """ Downloads and Unzips into the correct folder all of the Raw Not Capuchin Clips a user inputs in the id_url_list. """ clip_id, url = id_url_list path = os.path.join("Raw_Not_Capuchinbird_Clips",f"{clip_id}.zip") destination_path = "Raw_Not_Capuchinbird_Clips" url_response((path,url)) with zipfile.ZipFile(path, 'r') as zip_ref: zip_ref.extractall(destination_path) os.remove(path) def df_to_list_of_url_tuples(df): """ Extracts a list of tuples from the provided Sound_Urls csv file """ output = [] ids = df["id"].tolist() urls = df["url"].tolist() for i in range(len(ids)): output.append((ids[i],urls[i])) return output urls_d...
D
Data from: Device-based measures of sit-to-stand and stand-to-sit...
lifesciences.datastations.nl
bin, csv, html, jpeg +4
Updated Apr 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
P. ten Broeke; M.J. Olthof; D.G.J. Beckers; N.D. Hopkins; L.E.F. Graves; S.E. Carter; M. Cochrane; D. Gavin; A.S. Morris; A. Lichtwarck-Aschoff; S.A.E. Geurts; D.H.J. Thijssen; E. Bijleveld; P. ten Broeke; M.J. Olthof; D.G.J. Beckers; N.D. Hopkins; L.E.F. Graves; S.E. Carter; M. Cochrane; D. Gavin; A.S. Morris; A. Lichtwarck-Aschoff; S.A.E. Geurts; D.H.J. Thijssen; E. Bijleveld (2020). Device-based measures of sit-to-stand and stand-to-sit transitions of healthy working adults, 2020 [Dataset]. http://doi.org/10.17026/dans-zfe-gk3b
Explore at:
bin(22525), csv(3743046), pdf(121729), csv(5416722), xml(8901), jpeg(83638), bin(218), pdf(121957), bin(14570), html(813905), csv(3948982), pdf(69926), jpeg(82639), html(2523042), html(824290), zip(32248), pdf(70259), text/x-r-notebook(146907), text/x-r-notebook(25728), jpeg(272667), csv(5558188)Available download formats
Unique identifier
https://doi.org/10.17026/dans-zfe-gk3b
Dataset updated
Apr 30, 2020
Dataset provided by
DANS Data Station Life Sciences
Authors
P. ten Broeke; M.J. Olthof; D.G.J. Beckers; N.D. Hopkins; L.E.F. Graves; S.E. Carter; M. Cochrane; D. Gavin; A.S. Morris; A. Lichtwarck-Aschoff; S.A.E. Geurts; D.H.J. Thijssen; E. Bijleveld; P. ten Broeke; M.J. Olthof; D.G.J. Beckers; N.D. Hopkins; L.E.F. Graves; S.E. Carter; M. Cochrane; D. Gavin; A.S. Morris; A. Lichtwarck-Aschoff; S.A.E. Geurts; D.H.J. Thijssen; E. Bijleveld
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This data folder contains all processed data and analyses scripts used for analyses in the research described in the PNAS paper "The Temporal Dynamics of Sitting Behaviour at Work" by ten Broeke and colleagues (2020). In the paper, sitting behaviour was conceptualised as a continuous chain of sit-to-stand and stand-to-sit transitions, and multilevel time-to-event analysis was used to analyse the timing of these transitions. The data comprise ~30,000 posture transitions during work time from 156 UK-based employees from various work sites, objectively-measured by an activPAL monitor that was continuously worn for approximately one week.For the paper, a split-samples cross-validation procedure was used. Prior to looking at the data, we randomly split the data into two samples of equal size: A training sample (n = 79; 7,316 sit-to-stand and 7,263 stand-to-sit transitions) and a testing sample (n = 77; 7,216 sit-to-stand and 7,158 stand-to-sit transitions). We used the training sample for data exploration and fine-tuning of analyses and analytical decisions. After this, we preregistered our analysis plan for the testing sample and performed these analyses on the testing sample. Unless otherwise specified, in the paper we report results from the preregistered analyses on the testing sample.A more detailed description of the procedure and all measures is given in the Methodology file. The readme file describes the content and function of all files in the data folder, and all terminology and abbreviations used in the data sets and analyses scripts. The R markdown files and HTML output files contain all R code that was used for data processing, analysis, and visualization, and the power simulation. Dataset belonging to “The temporal dynamics of sitting and standing at work, 2020”
Rescaled Fashion-MNIST dataset
zenodo.org
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled Fashion-MNIST dataset [Dataset]. http://doi.org/10.5281/zenodo.15187793
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15187793
Dataset updated
Apr 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Time period covered
Apr 10, 2025
Description
Motivation

The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled Fashion-MNIST dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.

Access and rights

The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:

[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.

The h5 files containing the dataset

The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5

Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5

These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.

There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
r
Training.gov.au - Web service access to sandbox environment
researchdata.edu.au
data.gov.au
+2more
Updated Sep 17, 2014
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Department of Employment and Workplace Relations (2014). Training.gov.au - Web service access to sandbox environment [Dataset]. https://researchdata.edu.au/traininggovau-web-service-sandbox-environment/2996152
Explore at:
Dataset updated
Sep 17, 2014
Dataset provided by
data.gov.au
Authors
Department of Employment and Workplace Relations
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Area covered

Description
Introduction\r

Training.gov.au (TGA) is the National Register of Vocational Education and Training in Australia and contains authoritative information about Registered Training Organisations (RTOs), Nationally Recognised Training (NRT) and the approved scope of each RTO to deliver NRT as required in national and jurisdictional legislation.\r \r

TGA web-services overview\r

TGA has a web service available to allow external systems to access and utilise information stored in TGA through an external system. The TGA web service is exposed through a single interface and web service users are assigned a data reader role which will apply to all data stored in the TGA.\r \r The web service can be broadly split into three categories:\r \r 1. RTOs and other organisation types;\r \r 2. Training components including Accredited courses, Accredited course Modules Training Packages, Qualifications, Skill Sets and Units of Competency;\r \r 3. System metadata including static data and statistical classifications.\r \r Users will gain access to the TGA web service by first passing a user name and password through to the web server. The web server will then authenticate the user against the TGA security provider before passing the request to the application that supplies the web services.\r \r There are two web services environments:\r \r 1. Production - ws.training.gov.au – National Register production web services\r \r 2. Sandbox - ws.sandbox.training.gov.au – National Register sandbox web services. \r \r The National Register sandbox web service is used to test against the current version of the web services where the functionality will be identical to the current production release. The web service definition and schema of the National Register sandbox database will also be identical to that of production release at any given point in time. The National Register sandbox database will be cleared down at regular intervals and realigned with the National Register production environment.\r \r Each environment has three configured services:\r \r 1. Organisation Service;\r \r 2. Training Component Service; and\r \r 3. Classification Service.\r \r

Sandbox environment access\r

To access the download area for web services, navigate to http://tga.hsd.com.au and use the below name and password:\r \r Username: WebService.Read (case sensitive)\r \r Password: Asdf098 (case sensitive)\r \r This download area contains various versions of the following artefacts that you may find useful\r \r • Training.gov.au web service specification document;\r \r • Training.gov.au logical data model and definitions document;\r \r • .NET web service SDK sample app (with source code);\r \r • Java sample client (with source code);\r \r • How to setup web service client in VS 2010 video; and\r \r • Web services WSDL's and XSD's.\r \r For the business areas, the specification/definition documents and the sample application is a good place to start while the IT areas will find the sample source code and the video useful to start developing against the TGA web services.\r \r The web services Sandbox end point is: https://ws.sandbox.training.gov.au/Deewr.Tga.Webservices \r \r

Production web service access\r

Once you are ready to access the production web service, please email the TGA team at tgaproject@education.gov.au to obtain a unique user name and password.\r
Rescaled CIFAR-10 dataset
zenodo.org
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg (2025). Rescaled CIFAR-10 dataset [Dataset]. http://doi.org/10.5281/zenodo.15188748
Explore at:
Unique identifier
https://doi.org/10.5281/zenodo.15188748
Dataset updated
Apr 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Andrzej Perzanowski; Andrzej Perzanowski; Tony Lindeberg; Tony Lindeberg
Description
Motivation

The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.

The Rescaled CIFAR-10 dataset was introduced in the paper:

[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear.

with a pre-print available at arXiv:

[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.

Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:

[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2

and is therefore significantly more challenging.

Access and rights

The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:

[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.

and also for this new rescaled version, using the reference [1] above.

The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.

The dataset

The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].

There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].

The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.

The h5 files containing the dataset

The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:

cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5

Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2^k/4, with k being integers in the range [-4, 4]:

cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5

These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].

Instructions for loading the data set

The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.

The training dataset can be loaded in Python as:

with h5py.File(`

x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)

We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:

x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))

The test datasets can be loaded in Python as:

with h5py.File(`

x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)

The test datasets can be loaded in Matlab as:

x_test = h5read(`

The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Data and script pipeline for: Common to rare transfer learning (CORAL)...
zenodo.org
bin, html, tsv
Updated May 27, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Otso Ovaskainen; Otso Ovaskainen (2025). Data and script pipeline for: Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods [Dataset]. http://doi.org/10.5281/zenodo.15524215
Explore at:
bin, html, tsvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.15524215
Dataset updated
May 27, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Otso Ovaskainen; Otso Ovaskainen
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.

System requirements

· The software can be used in any operating system where R can be installed.

· We have developed and tested the software in a windows environment with R version 4.3.1.

· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).

· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).

· The use of the software does not require any non-standard hardware.

Installation guide

· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.

Demo 1: Software demo with simulated data

The software demonstration consists of two R-markdown files:

· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).

· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.

Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.

Demo 2: Software demo with a small subset of the data used in the paper

The software demonstration consists of one R-markdown file:

MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.

Scripts and data for reproducing the results presented in the paper (Demo 3)

The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:

· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.

· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.

· S03_import_posterior – imports the posterior distributions sampled by the initial model.

· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.

· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.

· S06_construct_coral_priors – calculate CORAL prior parameters.

The remaining scripts evaluate the model:

· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.

· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.

· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.

· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R² produced by cross-validation. Generates Fig. 4 of the paper.

· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.

· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.

· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.

Pre-processing scripts:

· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.

· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.

· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).

Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.

ENA Accession numbers

All raw sequence data are archived on mBRAVE and are publicly available in the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena; project accession number PRJEB86111; run accession numbers ERR15018787-ERR15009869; sample IDs for each accession and download URLs are provided in the file ENA_read_accessions.tsv).
h
A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and...
rodare.hzdr.de
zip
Updated Aug 30, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Wyrzykowska, Maria; della Maggiora, Gabriel; Deshpande, Nikita; Mokarian, Ashkan; Yakimovich, Artur (2024). A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy [Dataset]. http://doi.org/10.14278/rodare.3130
Explore at:
zipAvailable download formats
Unique identifier
https://doi.org/10.14278/rodare.3130
Dataset updated
Aug 30, 2024
Dataset provided by
Center for Advanced Systems Understanding (CASUS), Görlitz, Germany
Authors
Wyrzykowska, Maria; della Maggiora, Gabriel; Deshpande, Nikita; Mokarian, Ashkan; Yakimovich, Artur
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
How to cite us
Wyrzykowska, Maria, Gabriel della Maggiora, Nikita Deshpande, Ashkan Mokarian, and Artur Yakimovich. "A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy." bioRxiv (2024): 2024-08.

@article{wyrzykowska2024benchmark,
title={A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy},
author={Wyrzykowska, Maria and della Maggiora, Gabriel and Deshpande, Nikita and Mokarian, Ashkan and Yakimovich, Artur},
journal={bioRxiv},
pages={2024--08},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}

Data sources

Raw data used during the study can be found in corresponding references.

VACV: Yakimovich A, Andriasyan V, Witte R, Wang IH, Prasad V, Suomalainen M, Greber UF. Plaque2.0-A High-Throughput Analysis Framework to Score Virus-Cell Transmission and Clonal Cell Expansion. PLoS One. 2015 Sep 28;10(9):e0138760. doi: 10.1371/journal.pone.0138760. PMID: 26413745; PMCID: PMC4587671.

HADV: Andriasyan V, Yakimovich A, Petkidis A, Georgi F, Witte R, Puntener D, Greber UF. Microscopy deep learning predicts virus infections and reveals the mechanics of lytic-infected cells. iScience. 2021 May 15;24(6):102543. doi: 10.1016/j.isci.2021.102543. PMID: 34151222; PMCID: PMC8192562.

HSV, IAV, RV: Olszewski, D., Georgi, F., Murer, L. et al. High-content, arrayed compound screens with rhinovirus, influenza A virus and herpes simplex virus infections. Sci Data 9, 610 (2022). https://doi.org/10.1038/s41597-022-01733-4

Data organisation

For each virus (HADV, VACV, IAV, RV and HSV) we provide the processed data in a separate directory, divided into three subdirectories: `train`, `val` and `test`, containing the proposed data split. Each of the subfolders contains two npy files: `x.npy` and `y.npy`, where `x.npy` contains the fluorescence or brightfield signal (both for HADV, as separate channels) of the cells or nuclei and `y.npy` contains the viral signal. The data is already processed as described in the Data preparation section.

Additionally, Cellpose masks are made available for the test data in separate masks directory. For each virus except for VACV, there is a subdirectory `test` containing nuclei masks (`nuc.npy`). For HADV cell masks are also available (`cell.npy`).

Data preparation

Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been center-cropped to 5948x6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07). The HAdV dataset is also a timelapse, from which only the last timestep (49th) has been selected.

For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Zürich, from the Screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns. All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [-1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [-1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalization are summarized in Figure 3 of the manuscript in Related/alternate identifiers.

To prepare the cell nuclei masks, Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal. The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in `scripts/data_processing` directory of the [VIRVS repository](https://github.com/casus/virvs), first modifying the paths in them:

for HAdV data: `preprocess_hadv.py`

for VACV data: `stitch_vacv.py` + `preprocess_vacv.py`

for the rest of the viruses: `preprocess_other.py`

to prepare Cellpose predictions: `prepare_cellpose_preds.py` (for cells) and `prepare_cellpose_preds_nuc.py` (for nuclei)
SpaceNet 6 Multi-Sensor All-Weather Mapping
kaggle.com
Updated Mar 27, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sandhi Wangiyana (2021). SpaceNet 6 Multi-Sensor All-Weather Mapping [Dataset]. https://www.kaggle.com/sandhiwangiyana/spacenet-6-multisensor-allweather-mapping/code
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset updated
Mar 27, 2021
Dataset provided by
Kagglehttp://kaggle.com/
Authors
Sandhi Wangiyana
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Context

SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e. building footprint & road network detection).

I have been experimenting on SAR image segmentation for the past few months and would like to share with the Kaggle community this high quality dataset. It is the data from SpaceNet 6 challenge and is freely available in AWS opendata. This dataset only contains the training split, if you are interested in the testing split (only SAR) or the expanded SAR and optical dataset you should follow the steps and download from AWS S3. I share the dataset here to cut the steps of downloading the data and utilize Kaggle's powerful cloud computing.

Content

This openly-licensed dataset features a unique combination of half-meter Synthetic Aperture Radar (SAR) imagery from Capella Space and half-meter electro-optical (EO) imagery from Maxar.

https://miro.medium.com/max/267/1*rqZ_qb_gN2voJC7YEqOFuQ.png" alt="sar image1"> https://miro.medium.com/max/267/1*lM3Oj6wqfjhqI2o4SpngOQ.png" alt="rgb1">

https://miro.medium.com/max/334/1*lVzH0w8_GVIyZHSFUczbHw.png" alt="sar image2"> https://miro.medium.com/max/334/1*OYmAog0U9OGrScHFoHYqAQ.png" alt="rgb2">

SAR data are provided by Capella Space via an aerial mounted sensor collecting 204 individual image strips from both north and south facing look-angles. Each of the image strips features four polarizations (HH, HV, VH, and VV) of data and are preprocessed to display the intensity of backscatter in decibel units at half-meter spatial resolution

The 48k building footprint annotations are provided by 3D Basisregistratie Adressen en Gebouwen (3DBAG) dataset with some additional quality control. Also in the annotations are statistics of building heights derived from digital elevation model

https://miro.medium.com/max/500/1*x5VCNbYLjUmxjiLiT9jrYA.png" alt="building footprints">

Citation

Shermeyer, J., Hogan, D., Brown, J., Etten, A.V., Weir, N., Pacifici, F., Hänsch, R., Bastidas, A., Soenen, S., Bacastow, T.M., & Lewis, R. (2020). SpaceNet 6: Multi-Sensor All Weather Mapping Dataset. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 768-777. Arxiv paper

Inspiration

SAR imagery can be an answer to disaster analysis or frequent earth monitoring thanks to its active sensor, imaging day/night and in any cloud coverage. But SAR images have their own challenges, which requires a trained eye, unlike optical images. Moreover, the launch of new high resolution SAR satellites will yield massive quantity of earth observation data. Just like with any modern computer vision problem, this looks like a job for a deep learning model.
STEAD subsample 4 CDiffSD
zenodo.org
bin
Updated Apr 30, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Daniele Trappolini; Daniele Trappolini (2024). STEAD subsample 4 CDiffSD [Dataset]. http://doi.org/10.5281/zenodo.11094536
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11094536
Dataset updated
Apr 30, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Daniele Trappolini; Daniele Trappolini
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Time period covered
Apr 15, 2024
Description
STEAD Subsample Dataset for CDiffSD Training

Overview

This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.

Dataset Files

The dataset includes the following files:

train: Used for both training and validation phases (with validation train split). Contains earthquake ground truth traces.

noise_train: Used for both training and validation phases. Contains noise used to contaminate the traces.

test: Used for the testing phase, structured similarly to train.

noise_test: Used for the testing phase, contains noise data for testing.

Each file is structured to support the training and evaluation of seismic denoising models.

Data

The HDF5 files named noise contain two main datasets:

traces: This dataset includes N number of events, with each event being 6000 in size, representing the length of the traces. Each trace is organized into three channels in the following order: E (East-West), N (North-South), Z (Vertical).

metadata: This dataset contains the names of the traces for each event.

Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:

p_arrival: Contains the arrival indices of P-waves, expressed in counts.

s_arrival: Contains the arrival indices of S-waves, also expressed in counts.

Usage

To load these files in a Python environment, use the following approach:

```python

import h5py
import numpy as np

# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))

if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces

if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```

Ensure that the path to the file is correctly specified relative to your Python script.

Requirements

To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:

```bash
pip install numpy
pip install h5py
```
k
Data from: Pol-InSAR-Island - A Benchmark Dataset for Multi-frequency...
radar.kit.edu
radar-service.eu
tar
Updated Jun 22, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Joel Amao-Oliva; Niklas Pfeffer; Stefan Hinz; Andreas Reigber; Antje Thiele; Rolf Scheiber; Holger Dirks; Sylvia Marlene Hochstuhl (2023). Pol-InSAR-Island - A Benchmark Dataset for Multi-frequency Pol-InSAR Data Land Cover Classification [Dataset]. http://doi.org/10.35097/1450
Explore at:
tar(5397090304 bytes)Available download formats
Unique identifier
https://doi.org/10.35097/1450
Dataset updated
Jun 22, 2023
Dataset provided by
Thiele, Antje
Hinz, Stefan
Dirks, Holger
Pfeffer, Niklas
Reigber, Andreas
Scheiber, Rolf
Karlsruhe Institute of Technology
Amao-Oliva, Joel
Authors
Joel Amao-Oliva; Niklas Pfeffer; Stefan Hinz; Andreas Reigber; Antje Thiele; Rolf Scheiber; Holger Dirks; Sylvia Marlene Hochstuhl
Description
Pol-InSAR-Island dataset:

This folder contains multi-frequency Pol-InSAR data acquired by the F-SAR system of the German Aerospace Center (DLR) over Baltrum and corresponding land cover labels. Data structure: - data - FP1 # Flight path 1 - L # Frequency band - T6 # Pol-InSAR data - pauli.bmp # Pauli-RGB image of the master scene - S - ... - FP2 # Flight path 2 - ... - label - FP1 - label_train.bin - ... - FP2 - ... Data format: The data is provided as flat-binary raster files (.bin) with an accompanying ASCII header file (*.hdr) in ENVI-format. For Pol-InSAR data the real and imaginary components of the diagonal elments and upper triangle elements of the 6 x 6 coherency matrix are stored in seperated files (T11.bin, T12_real.bin, T12_imag.bin,...) Land cover labels contained in label_train.bin and label_test.bin are encoded as integers using the following mapping: 0 - Unassigned
1 - Tidal flat
2 - Water
3 - Coastal shrub
4 - Dense, high vegetation
5 - White dune
6 - Peat bog
7 - Grey dune
8 - Couch grass
9 - Upper saltmarsh
10 - Lower saltmarsh
11 - Sand
12 - Settlement

Facebook

Twitter

Click to copy link

Link copied

Cite

Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers (2022). Downsized camera trap images for automated classification [Dataset]. http://doi.org/10.5281/zenodo.6627707

Downsized camera trap images for automated classification

Explore at:

bin, zipAvailable download formats

Unique identifier

https://doi.org/10.5281/zenodo.6627707

Dataset updated

Dec 1, 2022

Dataset provided by

Zenodohttp://zenodo.org/

Authors

Danielle L Norman; Danielle L Norman; Oliver R Wearne; Oliver R Wearne; Philip M Chapman; Sui P Heon; Robert M Ewers; Philip M Chapman; Sui P Heon; Robert M Ewers

License

Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically

Description

Description:

Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.

Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions

Funding: These data were collected as part of research funded by:

NERC (NERC QMEE CDT Studentship, NE/P012345/1, http://gotw.nerc.ac.uk/list_full.asp?pcode=NE%2FP012345%2F1&cookieConsent=A)

This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.

XML metadata: GEMINI compliant metadata for this dataset is available here

Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip

CT_image_data_info2.xlsx

This file contains dataset metadata and 1 data tables:

Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
- filename: Root ID (Field type: id)
- camera_trap_site: Site ID for the camera trap location (Field type: location)
- taxon: Taxon recorded by camera trap (Field type: taxa)
- dist_level: Level of disturbance at site (Field type: ordered categorical)
- baseline: Label as to whether image is included in the baseline training, validation (val) or test set, or not included (NA) (Field type: categorical)
- increased_cap: Label as to whether image is included in the 'increased cap' training, validation (val) or test set, or not included (NA) (Field type: categorical)
- dist_individ_event_level: Label as to whether image is included in the 'individual disturbance level datasets split at event level' training, validation (val) or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_1: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 1' training or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 2' training or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 3' training or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 4' training or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance level 5' training or test set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_1_2: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 2 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_1_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_1_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_1_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 3 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 4 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_pair_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 4 and 5 (pair)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_2_3: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 3 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_2_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_2_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_1_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 4 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 3 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 2, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_triple_3_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 3, 4 and 5 (triple)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_quad_1_2_3_4: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 4 (quad)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_quad_1_2_3_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 3 and 5 (quad)' training set, or not included (NA) (Field type: categorical)
- dist_combined_event_level_quad_1_2_4_5: Label as to whether image is included in the 'disturbance level combination analysis split at event level: disturbance levels 1, 2, 4 and 5 (quad)' training set, or not included (NA) (Field type:

Clear search

Close search

Google apps

Main menu

Downsized camera trap images for automated classification

Data from: depression-detection

Galaxy, star, quasar dataset

physioDL: A dataset for geomorphic deep learning representing a scene...

V2 Balloon Detection Dataset Dataset

Data from: UC San Diego Resting State EEG Data from Patients with...

Data for the manuscript "Spatially resolved uncertainties for machine...

Mathematics Dataset

Table_1_Using machine learning for crop yield prediction in the past or the...

Transfer learning reveals sequence determinants of the quantitative response...

Z by HP Unlocked Challenge 3 - Signal Processing

Z by HP Unlocked Challenge 3

The Task

What is Unlocked?

The Data

Training Sets

Data from: Device-based measures of sit-to-stand and stand-to-sit...

Rescaled Fashion-MNIST dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Training.gov.au - Web service access to sandbox environment

Introduction\r

TGA web-services overview\r

Sandbox environment access\r

Production web service access\r

Rescaled CIFAR-10 dataset

Motivation

Access and rights

The dataset

The h5 files containing the dataset

Instructions for loading the data set

Data and script pipeline for: Common to rare transfer learning (CORAL)...

A Dataset for Virus Infection Reporter Virtual Staining in Fluorescence and...

SpaceNet 6 Multi-Sensor All-Weather Mapping

Context

Content

Citation

Inspiration

STEAD subsample 4 CDiffSD

STEAD Subsample Dataset for CDiffSD Training

Overview

Dataset Files

Data

Usage

Requirements

Data from: Pol-InSAR-Island - A Benchmark Dataset for Multi-frequency...

Pol-InSAR-Island dataset:

Downsized camera trap images for automated classification