Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields:
This dataset contains a collection of posts from Reddit. The posts have been collected from 3 subreddits: r/teenagers, r/SuicideWatch, and r/depression. There are 140,000 labeled posts for training and 60,000 labeled posts for testing. Both training and testing datasets have an equal split of labels. This dataset is not mine. The original dataset is on Kaggle: https://www.kaggle.com/datasets/nikhileswarkomati/suicide-watch/versions/13
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
The data used in this paper is from the 16th issue of SDSS. SDSS-DR16 contains a total of 930,268 photometric images, with 1.2 billion observation sources and tens of millions of spectra. The data obtained in this paper is downloaded from the official website of SDSS. Specifically, the data is obtained through the SkyServerAPI structure by using SQL query statements in the subwebsite CasJobs. As the current SDSS photometric table PhotoObj can only classify all observed sources as point sources and surface sources, the target sources can be better classified as galaxies, stars and quasars through spectra. Therefore, we obtain calibrated sources in CasJobs by crossing SpecPhoto with the PhotoObj star list, and obtain target position information (right ascension and declination). Calibrated sources can tell them apart precisely and quickly. Each calibrated source is labeled with the parameter "Class" as "galaxy", "star", or "quasar". In this paper, observation day area 3462, 3478, 3530 and other 4 areas in SDSS-DR16 are selected as experimental data, because a large number of sources can be obtained in these areas to provide rich sample data for the experiment. For example, there are 9891 sources in the 3462-day area, including 2790 galactic sources, 2378 stellar sources and 4723 quasar sources. There are 3862 sources in the 3478 day area, including 1759 galactic sources, 577 stellar sources and 1526 quasar sources. FITS files are a commonly used data format in the astronomical community. By cross-matching the star list and FITS files in the local celestial region, we obtained images of 5 bands of u, g, r, i and z of 12499 galaxy sources, 16914 quasar sources and 16908 star sources as training and testing data.1.1 Image SynthesisSDSS photometric data includes photometric images of five bands u, g, r, i and z, and these photometric image data are respectively packaged in single-band format in FITS files. Images of different bands contain different information. Since the three bands g, r and i contain more feature information and less noise, Astronomical researchers typically use the g, r, and i bands corresponding to the R, G, and B channels of the image to synthesize photometric images. Generally, different bands cannot be directly synthesized. If three bands are directly synthesized, the image of different bands may not be aligned. Therefore, this paper adopts the RGB multi-band image synthesis software written by He Zhendong et al. to synthesize images in g, r and i bands. This method effectively avoids the problem that images in different bands cannot be aligned. The pixel of each photometry image in this paper is 2048×1489.1.2 Data tailoringThis paper first clipped the target image, image clipping can use image segmentation tools to solve this problem, this paper uses Python to achieve this process. In the process of clipping, we convert the right ascension and declination of the source in the star list into pixel coordinates on the photometric image through the coordinate conversion formula, and determine the specific position of the source through the pixel coordinates. The coordinates are regarded as the center point and clipping is carried out in the form of a rectangular box. We found that the input image size affects the experimental results. Therefore, according to the target size of the source, we selected three different cutting sizes, 40×40, 60×60 and 80×80 respectively. Through experiment and analysis, we find that convolutional neural network has better learning ability and higher accuracy for data with small image size. In the end, we chose to divide the surface source galaxies, point source quasars, and stars into 40×40 sizes.1.3 Division of training and test dataIn order to make the algorithm have more accurate recognition performance, we need enough image samples. The selection of training set, verification set and test set is an important factor affecting the final recognition accuracy. In this paper, the training set, verification set and test set are set according to the ratio of 8:1:1. The purpose of verification set is used to revise the algorithm, and the purpose of test set is used to evaluate the generalization ability of the final algorithm. Table 1 shows the specific data partitioning information. The total sample size is 34,000 source images, including 11543 galaxy sources, 11967 star sources, and 10490 quasar sources.1.4 Data preprocessingIn this experiment, the training set and test set can be used as the training and test input of the algorithm after data preprocessing. The data quantity and quality largely determine the recognition performance of the algorithm. The pre-processing of the training set and the test set are different. In the training set, we first perform vertical flip, horizontal flip and scale on the cropped image to enrich the data samples and enhance the generalization ability of the algorithm. Since the features in the celestial object source have the flip invariability, the labels of galaxies, stars and quasars will not change after rotation. In the test set, our preprocessing process is relatively simple compared with the training set. We carry out simple scaling processing on the input image and test input the obtained image.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
physioDL: A dataset for geomorphic deep learning representing a scene classification task (predict physiographic region in which a hilshade occurs)Purpose: Datasets for geomorphic deep learning. Predict the physiographic region of an area based on a hillshade image. Terrain data were derived from the 30 m (1 arc-second) 3DEP product across the entirety of CONUS. Each chip has a spatial resolution of 30 m and 256 rows and columns of pixels. As a result, each chip measures 7,680 meters-by-7,680 meters. Two datasets are provided. Chips in the hs folder represent a multidirectional hillshade while chips in the ths folder represent a tinted multidirectional hillshade. Data are represented in 8-bit (0 to 255 scale, integer values). Data are projected to the Web Mercator projection relative to the WGS84 datum. Data were split into training, test, and validation partitions using stratified random sampling by region. 70% of the samples per region were selected for training, 15% for testing, and 15% for validation. There are a total of 16,325 chips. The following 22 physiographic regions are represented: "ADIRONDACK" , "APPALACHIAN PLATEAUS", "BASIN AND RANGE", "BLUE RIDGE", "CASCADE-SIERRA MOUNTAINS", "CENTRAL LOWLAND", "COASTAL PLAIN", "COLORADO PLATEAUS", "COLUMBIA PLATEAU", "GREAT PLAINS", "INTERIOR LOW PLATEAUS", "MIDDLE ROCKY MOUNTAINS", "NEW ENGLAND", "NORTHERN ROCKY MOUNTAINS", "OUACHITA", "OZARK PLATEAUS", "PACIFIC BORDER", and "PIEDMONT", "SOUTHERN ROCKY MOUNTAINS", "SUPERIOR UPLAND", "VALLEY AND RIDGE", "WYOMING BASIN". Input digital terrain models and hillshades are not provided due to the large file size (> 100GB). FilesphysioDL.csv: Table listing all image chips and associated physiographic region (id = unique ID for each chip; region = physiographic region; fnameHS = file name of associated chip in hs folder; fnameTHS = file name of associated chip in ths folder; set = data split (train, test, or validation).chipCounts.csv: Number of chips in each data partition per physiographic province. map.png: Map of data.makeChips.R: R script used to process the data into image chips and create CSV files.inputVectorschipBounds.shp = square extent of each chipchipCenters.shp = center coordinate of each chipprovinces.shp = physiographic provincesprovinces10km.shp = physiographic provinces with a 10 km negative buffer
Description:
This dataset was created to serve as an easy-to-use image dataset, perfect for experimenting with object detection algorithms. The main goal was to provide a simplified dataset that allows for quick setup and minimal effort in exploratory data analysis (EDA). This dataset is ideal for users who want to test and compare object detection models without spending too much time navigating complex data structures. Unlike datasets like chest x-rays, which require expert interpretation to evaluate model performance, the simplicity of balloon detection enables users to visually verify predictions without domain expertise.
The original Balloon dataset was more complex, as it was split into separate training and testing sets, with annotations stored in two separate JSON files. To streamline the experience, this updated version of the dataset merges all images into a single folder and replaces the JSON annotations with a single, easy-to-use CSV file. This new format ensures that the dataset can be loaded seamlessly with tools like Pandas, simplifying the workflow for researchers and developers.
Download Dataset
The dataset contains a total of 74 high-quality JPG images. Each featuring one or more balloons in different scenes and contexts. Accompanying the images is a CSV file that provides annotation data. Such as bounding box coordinates and labels for each balloon within the images. This structure makes the dataset easily accessible for a range of machine learning and computer vision tasks. Including object detection and image classification. The dataset is versatile and can be used to test algorithms like YOLO, Faster R-CNN, SSD, or other popular object detection models.
Key Features:
Image Format: 74 JPG images, ensuring high compatibility with most machine learning frameworks.
Annotations: A single CSV file that contains structure data. Including bounding box coordinates, class labels, and image file names, which can be load with Python libraries like Pandas.
Simplicity: Design for users to quickly start training object detection models without needing to preprocess or deeply explore the dataset.
Variety: The images feature balloons in various sizes, colors, and scenes, making it suitable for testing the robustness of detection models.
This dataset is sourced from Kaggle.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Welcome to the resting state EEG dataset collected at the University of San Diego and curated by Alex Rockhill at the University of Oregon.
Please email arockhil@uoregon.edu before submitting a manuscript to be published in a peer-reviewed journal using this data, we wish to ensure that the data to be analyzed and interpreted with scientific integrity so as not to mislead the public about findings that may have clinical relevance. The purpose of this is to be responsible stewards of the data without an "available upon reasonable request" clause that we feel doesn't fully represent the open-source, reproducible ethos. The data is freely available to download so we cannot stop your publication if we don't support your methods and interpretation of findings, however, in being good data stewards, we would like to offer suggestions in the pre-publication stage so as to reduce conflict in published scientific literature. As far as credit, there is precedent for receiving a mention in the acknowledgements section for reading and providing feedback on the paper or, for more involved consulting, being included as an author may be warranted. The purpose of asking for this is not to inflate our number of authorships; we take ethical considerations of the best way to handle intellectual property in the form of manuscripts very seriously, and, again, sharing is at the discretion of the author although we strongly recommend it. Please be ethical and considerate in your use of this data and all open-source data and be sure to credit authors by citing them.
An example of an analysis that we could consider problematic and would strongly advice to be corrected before submission to a publication would be using machine learning to classify Parkinson's patients from healthy controls using this dataset. This is because there are far too few patients for proper statistics. Parkinson's disease presents heterogeneously across patients, and, with a proper test-training split, there would be fewer than 8 patients in the testing set. Statistics on 8 or fewer patients for such a complicated diease would be inaccurate due to having too small of a sample size. Furthermore, if multiple machine learning algorithms were desired to be tested, a third split would be required to choose the best method, further lowering the number of patients in the testing set. We strongly advise against using any such approach because it would mislead patients and people who are interested in knowing if they have Parkinson's disease.
Note that UPDRS rating scales were collected by laboratory personnel who had completed online training and not a board-certified neurologist. Results should be interpreted accordingly, especially that analyses based largely on these ratings should be taken with the appropriate amount of uncertainty.
In addition to contacting the aforementioned email, please cite the following papers:
Nicko Jackson, Scott R. Cole, Bradley Voytek, Nicole C. Swann. Characteristics of Waveform Shape in Parkinson's Disease Detected with Scalp Electroencephalography. eNeuro 20 May 2019, 6 (3) ENEURO.0151-19.2019; DOI: 10.1523/ENEURO.0151-19.2019.
Swann NC, de Hemptinne C, Aron AR, Ostrem JL, Knight RT, Starr PA. Elevated synchrony in Parkinson disease detected with electroencephalography. Ann Neurol. 2015 Nov;78(5):742-50. doi: 10.1002/ana.24507. Epub 2015 Sep 2. PMID: 26290353; PMCID: PMC4623949.
George JS, Strunk J, Mak-McCully R, Houser M, Poizner H, Aron AR. Dopaminergic therapy in Parkinson's disease decreases cortical beta band coherence in the resting state and increases cortical beta band power during executive control. Neuroimage Clin. 2013 Aug 8;3:261-70. doi: 10.1016/j.nicl.2013.07.013. PMID: 24273711; PMCID: PMC3814961.
Appelhoff, S., Sanderson, M., Brooks, T., Vliet, M., Quentin, R., Holdgraf, C., Chaumon, M., Mikulan, E., Tavabi, K., Höchenberger, R., Welke, D., Brunner, C., Rockhill, A., Larson, E., Gramfort, A. and Jas, M. (2019). MNE-BIDS: Organizing electrophysiological data into the BIDS format and facilitating their analysis. Journal of Open Source Software 4: (1896).
Pernet, C. R., Appelhoff, S., Gorgolewski, K. J., Flandin, G., Phillips, C., Delorme, A., Oostenveld, R. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6, 103. https://doi.org/10.1038/s41597-019-0104-8.
Note: see this discussion on the structure of the json files that is sufficient but not optimal and will hopefully be changed in future versions of BIDS: https://neurostars.org/t/behavior-metadata-without-tsv-event-data-related-to-a-neuroimaging-data/6768/25.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This repository accompanies the manuscript "Spatially resolved uncertainties for machine learning potentials" by E. Heid, J. Schörghuber, R. Wanzenböck, and G. K. H. Madsen. The following files are available:
mc_experiment.ipynb is a Jupyter notebook for the Monte Carlo experiment described in the study (artificial model with only variance as error source).
aggregate_cut_relax.py contains code to cut and relax boxes for the water active learning cycle.
data_t1x.tar.gz contains reaction pathways for 10,073 reactions from a subset of the Transition1x dataset, split into training, validation and test sets. The training and validation sets contain the indices 1, 2, 9, and 10 from a 10-image nudged-elastic band search (40k datapoints), while the test set contains indices 3-8 (60k datapoints). The test set is ordered according to the reaction and index, i.e. rxn1_index3, rxn1_index4, [...] rxn1_index8, rxn2_index3, [...].
data_sto.tar.gz contains surface reconstructions of SrTiO3, randomly split into a training and validation set, as well as a test set.
data_h2o.tar.gz contains:
full_db.extxyz: The full dataset of 1.5k structures.
iter00_train.extxyz and iter00_validation.extxyz: The initial training and validation set for the active learning cycle.
the subfolders in the folders random, and uncertain, and atomic contain the training and validation sets for the random and uncertainty-based (local or atomic) active learning loops.
This dataset consists of mathematical question and answer pairs, from a range of question types at roughly school-level difficulty. This is designed to test the mathematical learning and algebraic reasoning skills of learning models.
## Example questions
Question: Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r.
Answer: 4
Question: Calculate -841880142.544 + 411127.
Answer: -841469015.544
Question: Let x(g) = 9*g + 1. Let q(c) = 2*c + 1. Let f(i) = 3*i - 39. Let w(j) = q(x(j)). Calculate f(w(a)).
Answer: 54*a - 30
It contains 2 million (question, answer) pairs per module, with questions limited to 160 characters in length, and answers to 30 characters in length. Note the training data for each question type is split into "train-easy", "train-medium", and "train-hard". This allows training models via a curriculum. The data can also be mixed together uniformly from these training datasets to obtain the results reported in the paper. Categories:
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The use of ML in agronomy has been increasing exponentially since the start of the century, including data-driven predictions of crop yields from farm-level information on soil, climate and management. However, little is known about the effect of data partitioning schemes on the actual performance of the models, in special when they are built for yield forecast. In this study, we explore the effect of the choice of predictive algorithm, amount of data, and data partitioning strategies on predictive performance, using synthetic datasets from biophysical crop models. We simulated sunflower and wheat data using OilcropSun and Ceres-Wheat from DSSAT for the period 2001-2020 in 5 areas of Spain. Simulations were performed in farms differing in soil depth and management. The data set of farm simulated yields was analyzed with different algorithms (regularized linear models, random forest, artificial neural networks) as a function of seasonal weather, management, and soil. The analysis was performed with Keras for neural networks and R packages for all other algorithms. Data partitioning for training and testing was performed with ordered data (i.e., older data for training, newest data for testing) in order to compare the different algorithms in their ability to predict yields in the future by extrapolating from past data. The Random Forest algorithm had a better performance (Root Mean Square Error 35-38%) than artificial neural networks (37-141%) and regularized linear models (64-65%) and was easier to execute. However, even the best models showed a limited advantage over the predictions of a sensible baseline (average yield of the farm in the training set) which showed RMSE of 42%. Errors in seasonal weather forecasting were not taken into account, so real-world performance is expected to be even closer to the baseline. Application of AI algorithms for yield prediction should always include a comparison with the best guess to evaluate if the additional cost of data required for the model compensates for the increase in predictive power. Random partitioning of data for training and validation should be avoided in models for yield forecasting. Crop models validated for the region and cultivars of interest may be used before actual data collection to establish the potential advantage as illustrated in this study.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Processed data and code for "Transfer learning reveals sequence determinants of the quantitative response to transcription factor dosage," Naqvi et al 2024.
Directory is organized into 4 subfolders, each tar'ed and gzipped:
data_analysis.tar.gz - Processed data for modulation of TWIST1 levels and calculation of RE responsiveness to TWIST1 dosage
baseline_models.tar.gz - Code and data for training baseline models to predict RE responsiveness to SOX9/TWIST1 dosage
chrombpnet_models.tar.gz - Remainder of code, data, and models for fine-tuning and interpreting ChromBPNet mdoels to predict RE responsiveness to SOX9/TWIST1 dosage
modisco_reports.zip - TF-MoDIsCo reports from running on the fine-tuned ChromBPNet models
mirny_model.tar.gz - Code and data for analyzing and fitting Mirny model of TF-nucleosome competition to observed RE dosage response curves
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Z by HP Unlocked Challenge 3 - Audio Recognition - Special thanks to Hunter Kempf for helping create this challenge! Watch the tutorial video here: https://youtu.be/9Txxl0FJZas
https://1c7gnu28cnefcask1kjk6k1d-wpengine.netdna-ssl.com/wp-content/uploads/2013/11/Capuchinbird.png" alt="CapuchinBird">
The Challenge is to build a Machine Learning model and code to count the number of Capuchinbird calls within a given clip. This can be done in a variety of ways and we would recommend that you do some research into various methods of audio recognition.
Unlocked is an action-packed interactive film made by Z by HP for data scientists. Sharpen your skills and solve the data driven mystery here: https://www.hp.com/us-en/workstations/industries/data-science/unlocked-challenge.html
The Data is split into Training and Testing Data. For Training Data we have provided enough clips to get a decent model but you can also find, parse, augment and use additional audio clips to improve your model performance.
In order to download and properly build our Training sets we have provided details and some example code for how to interact with the files.
def url_response(path_url_list):
path, url = path_url_list
r = requests.get(url, stream = True)
with open(path, 'wb') as f:
for ch in r:
f.write(ch)
def make_path_and_url(clip_id):
file_path = os.path.join("Raw_Capuchinbird_Clips",f"XC{clip_id} - Capuchinbird - Perissocephalus tricolor.mp3")
url = f"https://xeno-canto.org/{clip_id}/download"
return file_path, url
clip_ids = ['114131', '114132', '119294', '16803', '16804', '168899', '178167', '178168', '201990', '216010', '216012',
'22397', '227467', '227468', '227469', '227471', '27881', '27882', '307385', '336661', '3776', '387509',
'388470', '395129', '395130', '401294', '40355', '433953', '44070', '441733', '441734', '456236', '456314',
'46077', '46241', '479556', '493092', '495697', '504926', '504928', '513083', '520626', '526106', '574020',
'574021', '600460', '65195', '65196', '79965', '9221', '98557', '9892', '9893']
paths_and_urls = list(map(make_path_and_url, clip_ids))
ThreadPool(4).imap_unordered(url_response, paths_and_urls)
Parsing_Single_Call_Timestamps.csv
- Clip Timestamps where Capuchinbird Calls are audible
def parse_capuchinbird_clips(clip_tuple):
"""
Parses the audio clips described by the clip_tuple into the Parsed_Capuchinbird_Clips folder
"""
clip_id, starts_and_ends = clip_tuple
ms_to_seconds = 1000
mp3_filename = os.path.join("Raw_Capuchinbird_Clips",f"XC{clip_id} - Capuchinbird - Perissocephalus tricolor.mp3")
sound = AudioSegment.from_mp3(mp3_filename)
count = 0
for start, end in starts_and_ends:
sub_clip = sound[start*ms_to_seconds:end*ms_to_seconds]
sub_clip_name = f"XC{clip_id}-{count}"
sub_clip.export(os.path.join("Parsed_Capuchinbird_Clips",f"{sub_clip_name}.wav"), format="wav")
count += 1
def df_to_list_of_call_tuples(df):
"""
Extracts a list of tuples from the provided Parsing_Single_Call_Timestamps csv file
"""
output = []
for clip_id in df["id"].unique():
clip_df = df[df["id"]==clip_id].copy()
starts = clip_df["start"].tolist()
ends = clip_df["end"].tolist()
clip_list = []
for i in range(len(starts)):
clip_list.append((starts[i],ends[i]))
output.append((clip_id,clip_list))
return output
calls_df = pd.read_csv("Parsing_Single_Call_Timestamps.csv")
calls_list = df_to_list_of_call_tuples(calls_df)
ThreadPool(4).imap_unordered(parse_capuchinbird_clips, calls_list)
Other_Sound_Urls.csv
- Other Birds, Animals and Forest Noises
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This data folder contains all processed data and analyses scripts used for analyses in the research described in the PNAS paper "The Temporal Dynamics of Sitting Behaviour at Work" by ten Broeke and colleagues (2020). In the paper, sitting behaviour was conceptualised as a continuous chain of sit-to-stand and stand-to-sit transitions, and multilevel time-to-event analysis was used to analyse the timing of these transitions. The data comprise ~30,000 posture transitions during work time from 156 UK-based employees from various work sites, objectively-measured by an activPAL monitor that was continuously worn for approximately one week.For the paper, a split-samples cross-validation procedure was used. Prior to looking at the data, we randomly split the data into two samples of equal size: A training sample (n = 79; 7,316 sit-to-stand and 7,263 stand-to-sit transitions) and a testing sample (n = 77; 7,216 sit-to-stand and 7,158 stand-to-sit transitions). We used the training sample for data exploration and fine-tuning of analyses and analytical decisions. After this, we preregistered our analysis plan for the testing sample and performed these analyses on the testing sample. Unless otherwise specified, in the paper we report results from the preregistered analyses on the testing sample.A more detailed description of the procedure and all measures is given in the Methodology file. The readme file describes the content and function of all files in the data folder, and all terminology and abbreviations used in the data sets and analyses scripts. The R markdown files and HTML output files contain all R code that was used for data processing, analysis, and visualization, and the power simulation. Dataset belonging to “The temporal dynamics of sitting and standing at work, 2020”
The goal of introducing the Rescaled Fashion-MNIST dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled Fashion-MNIST dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled Fashion-MNIST dataset is more challenging than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2.
The Rescaled Fashion-MNIST dataset is provided on the condition that you provide proper citation for the original Fashion-MNIST dataset:
[4] Xiao, H., Rasul, K., and Vollgraf, R. (2017) “Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms”, arXiv preprint arXiv:1708.07747
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled FashionMNIST dataset is generated by rescaling 28×28 gray-scale images of clothes from the original FashionMNIST dataset [4]. The scale variations are up to a factor of 4, and the images are embedded within black images of size 72x72, with the object in the frame always centred. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 different classes in the dataset: “T-shirt/top”, “trouser”, “pullover”, “dress”, “coat”, “sandal”, “shirt”, “sneaker”, “bag” and “ankle boot”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 50 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 50 000 samples from the original Fashion-MNIST training set. The validation dataset, on the other hand, is formed from the final 10 000 images of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original Fashion-MNIST test set.
The training dataset file (~2.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
fashionmnist_with_scale_variations_tr50000_vl10000_te10000_outsize72-72_scte1p000_scte1p000.h5
Additionally, for the Rescaled FashionMNIST dataset, there are 9 datasets (~415 MB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p500.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p595.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p707.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte0p841.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p000.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p189.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p414.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte1p682.h5
fashionmnist_with_scale_variations_te10000_outsize72-72_scte2p000.h5
These dataset files were used for the experiments presented in Figures 6, 7, 14, 16, 19 and 23 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
There is also a closely related Fashion-MNIST with translations dataset, which in addition to scaling variations also comprises spatial translations of the objects.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Training.gov.au (TGA) is the National Register of Vocational Education and Training in Australia and contains authoritative information about Registered Training Organisations (RTOs), Nationally Recognised Training (NRT) and the approved scope of each RTO to deliver NRT as required in national and jurisdictional legislation.\r \r
TGA has a web service available to allow external systems to access and utilise information stored in TGA through an external system. The TGA web service is exposed through a single interface and web service users are assigned a data reader role which will apply to all data stored in the TGA.\r \r The web service can be broadly split into three categories:\r \r 1. RTOs and other organisation types;\r \r 2. Training components including Accredited courses, Accredited course Modules Training Packages, Qualifications, Skill Sets and Units of Competency;\r \r 3. System metadata including static data and statistical classifications.\r \r Users will gain access to the TGA web service by first passing a user name and password through to the web server. The web server will then authenticate the user against the TGA security provider before passing the request to the application that supplies the web services.\r \r There are two web services environments:\r \r 1. Production - ws.training.gov.au – National Register production web services\r \r 2. Sandbox - ws.sandbox.training.gov.au – National Register sandbox web services. \r \r The National Register sandbox web service is used to test against the current version of the web services where the functionality will be identical to the current production release. The web service definition and schema of the National Register sandbox database will also be identical to that of production release at any given point in time. The National Register sandbox database will be cleared down at regular intervals and realigned with the National Register production environment.\r \r Each environment has three configured services:\r \r 1. Organisation Service;\r \r 2. Training Component Service; and\r \r 3. Classification Service.\r \r
To access the download area for web services, navigate to http://tga.hsd.com.au and use the below name and password:\r \r Username: WebService.Read (case sensitive)\r \r Password: Asdf098 (case sensitive)\r \r This download area contains various versions of the following artefacts that you may find useful\r \r • Training.gov.au web service specification document;\r \r • Training.gov.au logical data model and definitions document;\r \r • .NET web service SDK sample app (with source code);\r \r • Java sample client (with source code);\r \r • How to setup web service client in VS 2010 video; and\r \r • Web services WSDL's and XSD's.\r \r For the business areas, the specification/definition documents and the sample application is a good place to start while the IT areas will find the sample source code and the video useful to start developing against the TGA web services.\r \r The web services Sandbox end point is: https://ws.sandbox.training.gov.au/Deewr.Tga.Webservices \r \r
Once you are ready to access the production web service, please email the TGA team at tgaproject@education.gov.au to obtain a unique user name and password.\r
The goal of introducing the Rescaled CIFAR-10 dataset is to provide a dataset that contains scale variations (up to a factor of 4), to evaluate the ability of networks to generalise to scales not present in the training data.
The Rescaled CIFAR-10 dataset was introduced in the paper:
[1] A. Perzanowski and T. Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, Journal of Mathematical Imaging and Vision, to appear.
with a pre-print available at arXiv:
[2] Perzanowski and Lindeberg (2024) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations”, arXiv preprint arXiv:2409.11140.
Importantly, the Rescaled CIFAR-10 dataset contains substantially more natural textures and patterns than the MNIST Large Scale dataset, introduced in:
[3] Y. Jansson and T. Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536, https://doi.org/10.1007/s10851-022-01082-2
and is therefore significantly more challenging.
The Rescaled CIFAR-10 dataset is provided on the condition that you provide proper citation for the original CIFAR-10 dataset:
[4] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.
and also for this new rescaled version, using the reference [1] above.
The data set is made available on request. If you would be interested in trying out this data set, please make a request in the system below, and we will grant you access as soon as possible.
The Rescaled CIFAR-10 dataset is generated by rescaling 32×32 RGB images of animals and vehicles from the original CIFAR-10 dataset [4]. The scale variations are up to a factor of 4. In order to have all test images have the same resolution, mirror extension is used to extend the images to size 64x64. The imresize() function in Matlab was used for the rescaling, with default anti-aliasing turned on, and bicubic interpolation overshoot removed by clipping to the [0, 255] range. The details of how the dataset was created can be found in [1].
There are 10 distinct classes in the dataset: “airplane”, “automobile”, “bird”, “cat”, “deer”, “dog”, “frog”, “horse”, “ship” and “truck”. In the dataset, these are represented by integer labels in the range [0, 9].
The dataset is split into 40 000 training samples, 10 000 validation samples and 10 000 testing samples. The training dataset is generated using the initial 40 000 samples from the original CIFAR-10 training set. The validation dataset, on the other hand, is formed from the final 10 000 image batch of that same training set. For testing, all test datasets are built from the 10 000 images contained in the original CIFAR-10 test set.
The training dataset file (~5.9 GB) for scale 1, which also contains the corresponding validation and test data for the same scale, is:
cifar10_with_scale_variations_tr40000_vl10000_te10000_outsize64-64_scte1p000_scte1p000.h5
Additionally, for the Rescaled CIFAR-10 dataset, there are 9 datasets (~1 GB each) for testing scale generalisation at scales not present in the training set. Each of these datasets is rescaled using a different image scaling factor, 2k/4, with k being integers in the range [-4, 4]:
cifar10_with_scale_variations_te10000_outsize64-64_scte0p500.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p595.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p707.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte0p841.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p000.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p189.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p414.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte1p682.h5
cifar10_with_scale_variations_te10000_outsize64-64_scte2p000.h5
These dataset files were used for the experiments presented in Figures 9, 10, 15, 16, 20 and 24 in [1].
The datasets are saved in HDF5 format, with the partitions in the respective h5 files named as
('/x_train', '/x_val', '/x_test', '/y_train', '/y_test', '/y_val'); which ones exist depends on which data split is used.
The training dataset can be loaded in Python as:
with h5py.File(`
x_train = np.array( f["/x_train"], dtype=np.float32)
x_val = np.array( f["/x_val"], dtype=np.float32)
x_test = np.array( f["/x_test"], dtype=np.float32)
y_train = np.array( f["/y_train"], dtype=np.int32)
y_val = np.array( f["/y_val"], dtype=np.int32)
y_test = np.array( f["/y_test"], dtype=np.int32)
We also need to permute the data, since Pytorch uses the format [num_samples, channels, width, height], while the data is saved as [num_samples, width, height, channels]:
x_train = np.transpose(x_train, (0, 3, 1, 2))
x_val = np.transpose(x_val, (0, 3, 1, 2))
x_test = np.transpose(x_test, (0, 3, 1, 2))
The test datasets can be loaded in Python as:
with h5py.File(`
x_test = np.array( f["/x_test"], dtype=np.float32)
y_test = np.array( f["/y_test"], dtype=np.int32)
The test datasets can be loaded in Matlab as:
x_test = h5read(`
The images are stored as [num_samples, x_dim, y_dim, channels] in HDF5 files. The pixel intensity values are not normalised, and are in a [0, 255] range.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The scripts and the data provided in this depository demonstrate how to apply the approach described in the paper "Common to rare transfer learning (CORAL) enables inference and prediction for a quarter million rare Malagasy arthropods" by Ovaskainen et al. Here we summarize how to use the software with a small, simulated dataset, with running time less than a minute in a typical laptop (Demo 1); (2) how to apply the analyses presented in the paper for a small subset of the data, with running time of ca. one hour in a powerful laptop (Demo 2); how to reproduce the full analyses presented in the paper, with running time up to several days, depending on the computational resources (Demo 3). The Demos 1 and 2 are aimed to be user-friendly starting points for understanding and testing how to implement CORAL. The Demo 3 is included mainly for reproducibility.
System requirements
· The software can be used in any operating system where R can be installed.
· We have developed and tested the software in a windows environment with R version 4.3.1.
· Demo 1 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).
· Demo 2 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0).
· Demo 3 requires the R-packages phytools (2.1-1), MASS (7.3-60), Hmsc (3.3-3), pROC (1.18.5) and MCMCpack (1.7-0), jsonify (1.2.2), buildmer (2.11), colorspace (2.1-0), matlib (0.9.6), vioplot (0.4.0), MLmetrics (1.1.3) and ggplot2 (3.5.0).
· The use of the software does not require any non-standard hardware.
Installation guide
· The CORAL functions are implemented in Hmsc (3.3-3). The software that applies the is presented as a R-pipeline and thus it does not require any installation other than installation of R.
Demo 1: Software demo with simulated data
The software demonstration consists of two R-markdown files:
· D01_software_demo_simulate_data. This script creates a simulated dataset of 100 species on 200 sampling units. The species occurrences are simulated with a probit model that assumes phylogenetically structured responses to two environmental predictors. The pipeline saves all the data needed to data analysis in the file allDataDemo.RData: XData (the first predictor; the second one is not provided in the dataset as it is assumed to remain unknown for the user), Y (species occurrence data), phy (phylogenetic tree), studyDesign (list of sampling units). Additionally, true values used for data generation are save in the file trueValuesDemo.RData: LF (the second environmental predictor that will be estimated through a latent factor approach), and beta (species responses to environmental predictors).
· D02_software_demo_apply_CORAL. This script loads the data generated by the script D01 and applies the CORAL approach to it. The script demonstrates the informativeness of the CORAL priors, the higher predictive power of CORAL models than baseline models, and the ability of CORAL to estimate the true values used for data generation.
Both markdown files provide more detailed information and illustrations. The provided html file shows the expected output. The running time of the demonstration is very short, from few seconds to at most one minute.
Demo 2: Software demo with a small subset of the data used in the paper
The software demonstration consists of one R-markdown file:
MA_small_demo. This script uses the CORAL functions in HMSC to analyze a small subset of the Malagasy arthropod data. In this demo, we define rare species as those with prevalence at least 40 and less than 50, and common species as those with prevalence at least 200. This leaves 51 species to the backbone model and 460 rare species modelled through the CORAL approach. The script assess model fit for CORAL priors, CORAL posteriors, and null models. It further visualizes the responses of both the common and the rare species to the included predictors.
Scripts and data for reproducing the results presented in the paper (Demo 3)
The input data for the script pipeline is the file “allData.RData”. This file includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy). Each file in the pipeline below depends on the outputs of previous files: they must be run in order. The first six files are used for fitting the backbone HMSC model and calculating parameters for the CORAL prior:
· S01_define_Hmsc_model - defines the initial HMSC model with fixed effects and sample- and site-level random effects.
· S02_export_Hmsc_model - prepares the initial model for HPC sampling for fitting with Hmsc-HPC. Fitting of the model can be then done in an HPC environment with the bash file generated by the script. Computationally intensive.
· S03_import_posterior – imports the posterior distributions sampled by the initial model.
· S04_define_second_stage_Hmsc_model - extracts latent factors from the initial model and defines the backbone model. This is then sampled using the same S02 export + S03 import scripts. Computationally intensive.
· S05_visualize_backbone_model – check backbone model quality with visual/numerical summaries. Generates Fig. 2 of the paper.
· S06_construct_coral_priors – calculate CORAL prior parameters.
The remaining scripts evaluate the model:
· S07_evaluate_prior_predictionss – use the CORAL prior to predict rare species presence/absences and evaluate the predictions in terms of AUC. Generates Fig. 3 of the paper.
· S08_make_training_test_split – generate train/test splits for cross-validation ensuring at least 40% of positive samples are in each partition.
· S09_cross-validate – fit CORAL and the baseline model to the train/test splits and calculate performance summaries. Note: we ran this once with the initial train/test split and then again with on the inverse split (i.e., training = ! training in the code, see comment). The paper presents the average results across these two splits. Computationally intensive.
· S10_show_cross-validation_results – Make plots visualizing AUC/Tjur’s R2 produced by cross-validation. Generates Fig. 4 of the paper.
· S11a_fit_coral_models – Fit the CORAL model to all 250k rare species. Computationally intensive.
· S11b_fit_baseline_models – Fit the baseline model to all 250k rare species. Computationally intensive.
· S12_compare_posterior_inference – compare posterior climate predictions using CORAL and baseline models on selected species, as well as variance reduction for all species. Generates Fig. 5 of the paper.
Pre-processing scripts:
· P01_preprocess_sequence_data.R – Reads in the outputs of the bioinformatics pipeline and converts them into R-objects.
· P02_download_climatic_data.R – Downloads the climatic data from "sis-biodiversity-era5-global” and adds that to metadata.
· P03_construct_Y_matrix.R – Converts the response matrix from a sparse data format to regular matrix. Saves “allData.RData”, which includes the metadata (meta), the response matrix (Y), and the taxonomical information (taxonomy).
Computationally intensive files had runtimes of 5-24 hours on high-performance machines. Preliminary testing suggests runtimes of over 100 hours on a standard laptop.
ENA Accession numbers
All raw sequence data are archived on mBRAVE and are publicly available in the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena; project accession number PRJEB86111; run accession numbers ERR15018787-ERR15009869; sample IDs for each accession and download URLs are provided in the file ENA_read_accessions.tsv).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
How to cite us
Wyrzykowska, Maria, Gabriel della Maggiora, Nikita Deshpande, Ashkan Mokarian, and Artur Yakimovich. "A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy." bioRxiv (2024): 2024-08.
@article{wyrzykowska2024benchmark,
title={A Benchmark for Virus Infection Reporter Virtual Staining in Fluorescence and Brightfield Microscopy},
author={Wyrzykowska, Maria and della Maggiora, Gabriel and Deshpande, Nikita and Mokarian, Ashkan and Yakimovich, Artur},
journal={bioRxiv},
pages={2024--08},
year={2024},
publisher={Cold Spring Harbor Laboratory}
}
Data sources
Raw data used during the study can be found in corresponding references.
Data organisation
For each virus (HADV, VACV, IAV, RV and HSV) we provide the processed data in a separate directory, divided into three subdirectories: `train`, `val` and `test`, containing the proposed data split. Each of the subfolders contains two npy files: `x.npy` and `y.npy`, where `x.npy` contains the fluorescence or brightfield signal (both for HADV, as separate channels) of the cells or nuclei and `y.npy` contains the viral signal. The data is already processed as described in the Data preparation section.
Additionally, Cellpose masks are made available for the test data in separate masks directory. For each virus except for VACV, there is a subdirectory `test` containing nuclei masks (`nuc.npy`). For HADV cell masks are also available (`cell.npy`).
Data preparation
Each of VACV plaques was imaged to produce 9 files per channel, that need to be stitched to recreate the whole plaque. To achieve this, multiview-stitcher toolbox has been used. The stitching was first performed on the third channel, representing the brightfield microscopy image of the samples. Then, the parameters found for this channel were used to stitch the rest of the channels. VACV dataset represents a timelapse, from which timesteps 100, 108 and 115 have been selected to produce the data then used in the experiments. Images have been center-cropped to 5948x6048 to match the size of the smallest image in the dataset (rounded down to the closest multiple of 2). The data was additionally manually filtered to remove the samples that constituted only uninfected cells (C02, C07, D02, D07, E02, E07, F02, F07). The HAdV dataset is also a timelapse, from which only the last timestep (49th) has been selected.
For the rest of the datasets (HSV, IAV, RV) only the negative control data was used, which was selected in the following way: from the data collected at the University of Zürich, from the Screen samples only the first 2 columns were selected and from the ZPlates and prePlates samples only the first 12 columns. All of the datasets were divided into training, validation and test holdouts in 0.7:0.2:0.1 ratios, using random seed 42 to ensure reproducibility. For the time-lapse data, it was ensured that the same sample from different timesteps only exists in one of the holdouts, to prevent information leakage and ensure fair evaluation. All of the samples were normalised to [-1, 1] range, by subtracting the 3rd percentile and dividing by the difference between percentile 99.8 and 3, clipping to [0, 1] and scaling to [-1, 1] range. For the brightfield channel of HAdV, percentiles 0.1 and 99.9 were used. These cutoff points were selected based on the analysis of the histograms of the values attained by the data, to make the best use of the available data range. Specific values used for the normalization are summarized in Figure 3 of the manuscript in Related/alternate identifiers.
To prepare the cell nuclei masks, Cellpose model with pre-trained weights cyto3 has been used on the fluorescence channel. The diameter was set to 7 for all the datasets except for HAdV, for which the automatic estimation of the diameter was employed. Cell masks were prepared using Cellpose with pre-trained weights cyto3 with a diameter set to 70 on brightfield images stacked with fluorescence nuclei signal. The data preparation can be reproduced by first downloading the datasets and then running scripts that are located in `scripts/data_processing` directory of the [VIRVS repository](https://github.com/casus/virvs), first modifying the paths in them:
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
SpaceNet LLC is a nonprofit organization dedicated to accelerating open source, artificial intelligence applied research for geospatial applications, specifically foundational mapping (i.e. building footprint & road network detection).
I have been experimenting on SAR image segmentation for the past few months and would like to share with the Kaggle community this high quality dataset. It is the data from SpaceNet 6 challenge and is freely available in AWS opendata. This dataset only contains the training split, if you are interested in the testing split (only SAR) or the expanded SAR and optical dataset you should follow the steps and download from AWS S3. I share the dataset here to cut the steps of downloading the data and utilize Kaggle's powerful cloud computing.
This openly-licensed dataset features a unique combination of half-meter Synthetic Aperture Radar (SAR) imagery from Capella Space and half-meter electro-optical (EO) imagery from Maxar.
https://miro.medium.com/max/267/1*rqZ_qb_gN2voJC7YEqOFuQ.png" alt="sar image1">
https://miro.medium.com/max/267/1*lM3Oj6wqfjhqI2o4SpngOQ.png" alt="rgb1">
https://miro.medium.com/max/334/1*lVzH0w8_GVIyZHSFUczbHw.png" alt="sar image2">
https://miro.medium.com/max/334/1*OYmAog0U9OGrScHFoHYqAQ.png" alt="rgb2">
SAR data are provided by Capella Space via an aerial mounted sensor collecting 204 individual image strips from both north and south facing look-angles. Each of the image strips features four polarizations (HH, HV, VH, and VV) of data and are preprocessed to display the intensity of backscatter in decibel units at half-meter spatial resolution
The 48k building footprint annotations are provided by 3D Basisregistratie Adressen en Gebouwen (3DBAG) dataset with some additional quality control. Also in the annotations are statistics of building heights derived from digital elevation model
https://miro.medium.com/max/500/1*x5VCNbYLjUmxjiLiT9jrYA.png" alt="building footprints">
Shermeyer, J., Hogan, D., Brown, J., Etten, A.V., Weir, N., Pacifici, F., Hänsch, R., Bastidas, A., Soenen, S., Bacastow, T.M., & Lewis, R. (2020). SpaceNet 6: Multi-Sensor All Weather Mapping Dataset. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 768-777. Arxiv paper
SAR imagery can be an answer to disaster analysis or frequent earth monitoring thanks to its active sensor, imaging day/night and in any cloud coverage. But SAR images have their own challenges, which requires a trained eye, unlike optical images. Moreover, the launch of new high resolution SAR satellites will yield massive quantity of earth observation data. Just like with any modern computer vision problem, this looks like a job for a deep learning model.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset is a subsampled version of the STEAD dataset, specifically tailored for training our CDiffSD model (Cold Diffusion for Seismic Denoising). It consists of four HDF5 files, each saved in a format that requires Python's `h5py` method for opening.
The dataset includes the following files:
Each file is structured to support the training and evaluation of seismic denoising models.
The HDF5 files named noise contain two main datasets:
Similarly, the train and test files, which contain earthquake data, include the same traces and metadata datasets, but also feature two additional datasets:
To load these files in a Python environment, use the following approach:
```python
import h5py
import numpy as np
# Open the HDF5 file in read mode
with h5py.File('train_noise.hdf5', 'r') as file:
# Print all the main keys in the file
print("Keys in the HDF5 file:", list(file.keys()))
if 'traces' in file:
# Access the dataset
data = file['traces'][:10] # Load the first 10 traces
if 'metadata' in file:
# Access the dataset
trace_name = file['metadata'][:10] # Load the first 10 metadata entries```
Ensure that the path to the file is correctly specified relative to your Python script.
To use this dataset, ensure you have Python installed along with the Pandas library, which can be installed via pip if not already available:
```bash
pip install numpy
pip install h5py
```
This folder contains multi-frequency Pol-InSAR data acquired by the F-SAR system of the German Aerospace Center (DLR) over Baltrum and corresponding land cover labels.
Data structure:
- data
- FP1 # Flight path 1
- L # Frequency band
- T6 # Pol-InSAR data
- pauli.bmp # Pauli-RGB image of the master scene
- S
- ...
- FP2 # Flight path 2
- ...
- label
- FP1
- label_train.bin
- ...
- FP2
- ...
Data format:
The data is provided as flat-binary raster files (.bin) with an accompanying ASCII header file (*.hdr) in ENVI-format.
For Pol-InSAR data the real and imaginary components of the diagonal elments and upper triangle elements of the 6 x 6 coherency matrix are stored in seperated files (T11.bin, T12_real.bin, T12_imag.bin,...)
Land cover labels contained in label_train.bin and label_test.bin are encoded as integers using the following mapping:
0 - Unassigned
1 - Tidal flat
2 - Water
3 - Coastal shrub
4 - Dense, high vegetation
5 - White dune
6 - Peat bog
7 - Grey dune
8 - Couch grass
9 - Upper saltmarsh
10 - Lower saltmarsh
11 - Sand
12 - Settlement
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description:
Downsized (256x256) camera trap images used for the analyses in "Can CNN-based species classification generalise across variation in habitat within a camera trap survey?", and the dataset composition for each analysis. Note that images tagged as 'human' have been removed from this dataset. Full-size images for the BorneoCam dataset will be made available at LILA.science. The full SAFE camera trap dataset metadata is available at DOI: 10.5281/zenodo.6627707.
Project: This dataset was collected as part of the following SAFE research project: Machine learning and image recognition to monitor spatio-temporal changes in the behaviour and dynamics of species interactions
Funding: These data were collected as part of research funded by:
This dataset is released under the CC-BY 4.0 licence, requiring that you cite the dataset in any outputs, but has the additional condition that you acknowledge the contribution of these funders in any outputs.
XML metadata: GEMINI compliant metadata for this dataset is available here
Files: This dataset consists of 3 files: CT_image_data_info2.xlsx, DN_256x256_image_files.zip, DN_generalisability_code.zip
CT_image_data_info2.xlsx
This file contains dataset metadata and 1 data tables:
Dataset Images (described in worksheet Dataset_images)
Description: This worksheet details the composition of each dataset used in the analyses
Number of fields: 69
Number of data rows: 270287
Fields: