Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides annotated very-high-resolution satellite RGB images extracted from Google Earth to train deep learning models to recognize Juniperus communis L. and Juniperus sabina L. shrubs. All images are from the high mountain of Sierra Nevada in Spain. The dataset contains 2000 images (.jpg) of size 512x512 pixels partitioned into two classes: Shrubs and NoShrubs. We also provide partitioning of the data into Train (1800 images), Test (100 images), and Validation (100 images) subsets.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset contains satellite imagery of 4,454 power plants within the United States. The imagery is provided at two resolutions: 1m (4-band NAIP iamgery with near-infrared) and 30m (Landsat 8, pansharpened to 15m). The NAIP imagery is available for the U.S. and Landsat 8 is available globally. This dataset may be of value for computer vision work, machine learning, as well as energy and environmental analyses.Additionally, annotations of the specific locations of the spatial extent of the power plants in each image is provided. These annotations were collected via the crowdsourcing platform, Amazon Mechanical Turk, using multiple annotators for each image to ensure quality. Links to the sources of the imagery data, the annotation tool, and the team that created the dataset are included in the "References" section.To read more on these data, please refer to the "Power Plant Satellite Imagery Dataset Overview.pdf" file. To download a sample of the data without downloading the entire dataset, download "sample.zip" which includes two sample powerplants and the NAIP, Landsat 8, and binary annotations for each.Note: the NAIP imagery may appear "washed out" when viewed in standard image viewing software because it includes a near-infrared band in addition to the standard RGB data.
U.S. Government Workshttps://www.usa.gov/government-works
License information was derived automatically
Coast Train is a library of images of coastal environments, annotations, and corresponding thematic label masks (or ‘label images’) collated for the purposes of training and evaluating machine learning (ML), deep learning, and other models for image segmentation. It includes image sets from both geospatial satellite, aerial, and UAV imagery and orthomosaics, as well as non-geospatial oblique and nadir imagery. Images include a diverse range of coastal environments from the U.S. Pacific, Gulf of Mexico, Atlantic, and Great Lakes coastlines, consisting of time-series of high-resolution (≤1m) orthomosaics and satellite image tiles (10–30m). Each image, image annotation, and labelled image is available as a single NPZ zipped file. NPZ files follow the following naming convention: {datasource}_{numberofclasses}_{threedigitdatasetversion}.zip, where {datasource} is the source of the original images (for example, NAIP, Landsat 8, Sentinel 2), {numberofclasses} is the number of classes us ...
https://www.datainsightsmarket.com/privacy-policyhttps://www.datainsightsmarket.com/privacy-policy
The data labeling market is experiencing robust growth, projected to reach $3.84 billion in 2025 and maintain a Compound Annual Growth Rate (CAGR) of 28.13% from 2025 to 2033. This expansion is fueled by the increasing demand for high-quality training data across various sectors, including healthcare, automotive, and finance, which heavily rely on machine learning and artificial intelligence (AI). The surge in AI adoption, particularly in areas like autonomous vehicles, medical image analysis, and fraud detection, necessitates vast quantities of accurately labeled data. The market is segmented by sourcing type (in-house vs. outsourced), data type (text, image, audio), labeling method (manual, automatic, semi-supervised), and end-user industry. Outsourcing is expected to dominate the sourcing segment due to cost-effectiveness and access to specialized expertise. Similarly, image data labeling is likely to hold a significant share, given the visual nature of many AI applications. The shift towards automation and semi-supervised techniques aims to improve efficiency and reduce labeling costs, though manual labeling will remain crucial for tasks requiring high accuracy and nuanced understanding. Geographical distribution shows strong potential across North America and Europe, with Asia-Pacific emerging as a key growth region driven by increasing technological advancements and digital transformation. Competition in the data labeling market is intense, with a mix of established players like Amazon Mechanical Turk and Appen, alongside emerging specialized companies. The market's future trajectory will likely be shaped by advancements in automation technologies, the development of more efficient labeling techniques, and the increasing need for specialized data labeling services catering to niche applications. Companies are focusing on improving the accuracy and speed of data labeling through innovations in AI-powered tools and techniques. Furthermore, the rise of synthetic data generation offers a promising avenue for supplementing real-world data, potentially addressing data scarcity challenges and reducing labeling costs in certain applications. This will, however, require careful attention to ensure that the synthetic data generated is representative of real-world data to maintain model accuracy. This comprehensive report provides an in-depth analysis of the global data labeling market, offering invaluable insights for businesses, investors, and researchers. The study period covers 2019-2033, with 2025 as the base and estimated year, and a forecast period of 2025-2033. We delve into market size, segmentation, growth drivers, challenges, and emerging trends, examining the impact of technological advancements and regulatory changes on this rapidly evolving sector. The market is projected to reach multi-billion dollar valuations by 2033, fueled by the increasing demand for high-quality data to train sophisticated machine learning models. Recent developments include: September 2024: The National Geospatial-Intelligence Agency (NGA) is poised to invest heavily in artificial intelligence, earmarking up to USD 700 million for data labeling services over the next five years. This initiative aims to enhance NGA's machine-learning capabilities, particularly in analyzing satellite imagery and other geospatial data. The agency has opted for a multi-vendor indefinite-delivery/indefinite-quantity (IDIQ) contract, emphasizing the importance of annotating raw data be it images or videos—to render it understandable for machine learning models. For instance, when dealing with satellite imagery, the focus could be on labeling distinct entities such as buildings, roads, or patches of vegetation.October 2023: Refuel.ai unveiled a new platform, Refuel Cloud, and a specialized large language model (LLM) for data labeling. Refuel Cloud harnesses advanced LLMs, including its proprietary model, to automate data cleaning, labeling, and enrichment at scale, catering to diverse industry use cases. Recognizing that clean data underpins modern AI and data-centric software, Refuel Cloud addresses the historical challenge of human labor bottlenecks in data production. With Refuel Cloud, enterprises can swiftly generate the expansive, precise datasets they require in mere minutes, a task that traditionally spanned weeks.. Key drivers for this market are: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Potential restraints include: Rising Penetration of Connected Cars and Advances in Autonomous Driving Technology, Advances in Big Data Analytics based on AI and ML. Notable trends are: Healthcare is Expected to Witness Remarkable Growth.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
The dataset consist of three categories; image subsets, burned area masks and quicklooks. The image subsets are derived from Landsat-8 scenes taken during the years 2019 and 2021. Each image has a size of 512x512 pixels and consists of 8 multispectral. The sequence of band names from band 1 to band 7 of the image subset is same as the sequence of band names of landsat-8 scene, except for band 8 of the image subset which is band 9 (cirrus band) in the original landsat-8 scene. The image subsets are saved in GeoTIFF file format with the latitude longitude coordinate system and WGS 1984 as the datum. The spatial resolution of image subsets is 0.00025 degree and the pixel values are stored in 16 bit unsigned integer with the range of value from 0 to 65535. The total of the dataset is 227 images which containing object of burned area surrounded by various ecological diversity backgrounds such as forest, shrub, grassland, waterbody, bare land, settlement, cloud and cloud shadow. In some cases, there are some image subsets with the burned areas covered by smoke due to the fire is still active. Some image subsets also overlap each other to cover the area of burned scar which the area is too large. The burned area mask is a binary annotation image which consists of two classes; burned area as the foreground and non-burned area as the background. These binary images are saved in 8 bit unsigned integer where the burned area is indicated by the pixel value of 1, whereas the non-burned area is indicated by 0. The burned area masks in this dataset contain only burned scars and are not contaminated with thick clouds, shadows, and vegetation. Among 227 images, 206 images contain burned areas whereas 21 images contain only background. The highest number of images in this dataset is dominated by images with coverage percentage of burned area between 0 and 10 percent. Our dataset also provides quicklook image as a quick preview of image subset. It offers a fast and full size preview of image subset without opening the file using any GIS software. The quicklook images can also be used for training and evaluating the model as a substitute of image subsets. The image size is 512x512 pixels same as the size of image subset and annotation image. It consists of three bands as a false color composite quicklook images, with combination of band 7 (SWIR-2), band 5 (NIR), and band 4 (red). These RGB composite images have been performed contrast stretching to enhance the images visualizations. The quicklook images are stored in GeoTIFF file format with 8 bit unsigned integer.
This work was financed by Riset Inovatif Produktif (RISPRO) fund through Prioritas Riset Nasional (PRN) project, grant no. 255/E1/PRN/2020 for 2020 - 2021 contract period.
Satellite image Classification Dataset-RSI-CB256 , This dataset has 4 different classes mixed from Sensors and google map snapshot
The past years have witnessed great progress on remote sensing (RS) image interpretation and its wide applications. With RS images becoming more accessible than ever before, there is an increasing demand for the automatic interpretation of these images. In this context, the benchmark datasets serve as essential prerequisites for developing and testing intelligent interpretation algorithms. After reviewing existing benchmark datasets in the research community of RS image interpretation, this article discusses the problem of how to efficiently prepare a suitable benchmark dataset for RS image interpretation. Specifically, we first analyze the current challenges of developing intelligent algorithms for RS image interpretation with bibliometric investigations. We then present the general guidance on creating benchmark datasets in efficient manners. Following the presented guidance, we also provide an example on building RS image dataset, i.e., Million-AID, a new large-scale benchmark dataset containing a million instances for RS image scene classification. Several challenges and perspectives in RS image annotation are finally discussed to facilitate the research in benchmark dataset construction. We do hope this paper will provide the RS community an overall perspective on constructing large-scale and practical image datasets for further research, especially data-driven ones.
Annotated Datasets for RS Image Interpretation The interpretation of RS images has been playing an increasingly important role in a large diversity of applications, and thus, has attracted remarkable research attentions. Consequently, various datasets have been built to advance the development of interpretation algorithms for RS images. Covering literature published over the past decade, we perform a systematic review of the existing RS image datasets concerning the current mainstream of RS image interpretation tasks, including scene classification, object detection, semantic segmentation and change detection.
Artificial Intelligence, Computer Vision, Image Processing, Deep Learning, Satellite Image, Remote Sensing
BD-Sat provides a high-resolution dataset that includes pixel-by-pixel LULC annotations for Dhaka metropolitan city and the rural/urban area surrounding it. With the strict and standard procedure, the ground truth is made using Bing-satellite imagery at a ground spatial distance of 2.22 meters/pixel. Three stages well-defined annotation process has been followed with the support from geographic information system (GIS) experts to ensure the reliability of the annotations. We perform several experiments to establish the benchmark results. Results show that the annotated BD-Sat is sufficient to train large deep-learning models with adequate accuracy with five major LULC classes: forest, farmland, built-up, water, and meadow.
https://www.verifiedmarketresearch.com/privacy-policy/https://www.verifiedmarketresearch.com/privacy-policy/
Data Annotation Tools Market size was valued at USD 0.03 Billion in 2023 and is projected to reach USD 4.04 Billion by 2030, growing at a CAGR of 25.5% during the forecasted period 2024 to 2030.
Global Data Annotation Tools Market Drivers
The market drivers for the Data Annotation Tools Market can be influenced by various factors. These may include:
Rapid Growth in AI and Machine Learning: The demand for data annotation tools to label massive datasets for training and validation purposes is driven by the rapid growth of AI and machine learning applications across a variety of industries, including healthcare, automotive, retail, and finance.
Increasing Data Complexity: As data kinds like photos, videos, text, and sensor data become more complex, more sophisticated annotation tools are needed to handle a variety of data formats, annotations, and labeling needs. This will spur market adoption and innovation.
Quality and Accuracy Requirements: Training accurate and dependable AI models requires high-quality annotated data. Organizations can attain enhanced annotation accuracy and consistency by utilizing data annotation technologies that come with sophisticated annotation algorithms, quality control measures, and human-in-the-loop capabilities.
Applications Specific to Industries: The development of specialized annotation tools for particular industries, like autonomous vehicles, medical imaging, satellite imagery analysis, and natural language processing, is prompted by their distinct regulatory standards and data annotation requirements.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
[Please use version 1.0.1]
The CloudTracks dataset consists of 1,780 MODIS satellite images hand-labeled for the presence of more than 12,000 ship tracks. More information about how the dataset was constructed may be found at github.com/stanfordmlgroup/CloudTracks. The file structure of the dataset is as follows:
CloudTracks/
full/
images/
(sample image name) mod2002121.1920D.png
jsons/
(sample json name) mod2002121.1920D.json
The naming convention is as follows:
mod2002121.1920D: the first 3 letters specify which of the sensors on the two MODIS satellites captured the image, mod for Terra and myd for Aqua. This is followed by a 4 digit year (2002) and a 3 digit day of the year (121). The following 4 digits specify the time of day (1920; 24 hour format in the UTC timezone), followed by D or N for Day or Night.
The 1,780 MODIS Terra and Aqua images were collected between 2002 and 2021 inclusive over various stratocumulus cloud regions (such as the East Pacific and East Atlantic) where ship tracks have commonly been observed. Each image has dimension 1354 x 2030 and a spatial resolution of 1km. Of the 36 bands collected by the instruments, we selected channels 1, 20, and 32 to capture useful physical properties of cloud formations.
The labels are found in the corresponding JSON files for each image. The following keys in the json are particularly important:
imagePath: the filename of the image.
shapes: the list of annotations corresponding to the image, where each element of the list is a dictionary corresponding to a single instance annotation. The dictionary has a key with value "shiptrack" or "uncertain" which is the label of the annotation and the corresponding value is a linestrip detailing the ship track path.
Further pre-processing details may be found at the GitHub link above. If you have any questions about the dataset, contact us at:
mahmedch@stanford.edu, lynakim@stanford.edu, jirvin16@cs.stanford.edu
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MLRSNet provides different perspectives of the world captured from satellites. That is, it is composed of high spatial resolution optical satellite images. MLRSNet contains 109,161 remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1,500 to 3,000. The images have a fixed size of 256×256 pixels with various pixel resolutions (~10m to 0.1m). Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. The dataset can be used for multi-label based image classification, multi-label based image retrieval, and image segmentation.
The Dataset includes: 1. Images folder: 46 categories, 109,161 high-spatial resolution remote sensing images. 2. Labels folders: each category has a .csv file. 3. Categories_names. xlsx: Sheet1 lists the names of 46 categories, and the Sheet2 shows the associated multi-label to each category.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
OverviewThe dataset contains fully annotated electric transmission and distribution infrastructure for approximately 321 sq km of high resolution satellite and aerial imagery from around the world. The imagery and associated infrastructure annotations span 14 cities and 5 continents, and were selected to represent diversity in human settlement density (i.e. rural vs urban), terrain type, and development index. This dataset may be of particular interest to those looking to train machine learning algorithms to automatically identify energy infrastructure in satellite imagery or for those working on domain adaptation for computer vision. Automated algorithms for identifying electricity infrastructure in satellite imagery may assist policy makers identify the best pathway to electrification for unelectrified areas.Data SourcesThis dataset contains data sourced from the LINZ Data Service licensed for reuse under CC BY 4.0. This dataset also contained extracts from the SpaceNet dataset:SpaceNet on Amazon Web Services (AWS). “Datasets.” The SpaceNet Catalog. Last modified April 30, 2018 (link below).Other imagery data included in this dataset are from the Connecticut Department of Energy and Environmental Protection and the U.S. Geological Survey. Links to each of the imagery data sources are provided below as well as the link to the annotation tool and the github repository that provides tools for using these data.AcknowledgementsThis dataset was created as part of the Duke University Data+ project, "Energy Infrastructure Map of the World" (link below) in collaboration with the Information Initiative at Duke and the Duke University Energy Initiative.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset consists of annotated high-resolution aerial imagery of roof materials in Bonn, Germany, in the Ultralytics YOLO instance segmentation dataset format. Aerial imagery was sourced from OpenAerialMap, specifically from the Maxar Open Data Program. Roof material labels and building outlines were sourced from OpenStreetMap. Images and labels are split into training, validation, and test sets, meant for future machine learning models to be trained upon, for both building segmentation and roof type classification.The dataset is intended for applications such as informing studies on thermal efficiency, roof durability, heritage conservation, or socioeconomic analyses. There are six roof material types: roof tiles, tar paper, metal, concrete, gravel, and glass.Note: The data is in a .zip due to file upload limits. Please find a more detailed dataset description in the README.md
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Overview
This dataset comprises cloud masks for 513 1022-by-1022 pixel subscenes, at 20m resolution, sampled random from the 2018 Level-1C Sentinel-2 archive. The design of this dataset follows from some observations about cloud masking: (i) performance over an entire product is highly correlated, thus subscenes provide more value per-pixel than full scenes, (ii) current cloud masking datasets often focus on specific regions, or hand-select the products used, which introduces a bias into the dataset that is not representative of the real-world data, (iii) cloud mask performance appears to be highly correlated to surface type and cloud structure, so testing should include analysis of failure modes in relation to these variables.
The data was annotated semi-automatically, using the IRIS toolkit, which allows users to dynamically train a Random Forest (implemented using LightGBM), speeding up annotations by iteratively improving it's predictions, but preserving the annotator's ability to make final manual changes when needed. This hybrid approach allowed us to process many more masks than would have been possible manually, which we felt was vital in creating a large enough dataset to approximate the statistics of the whole Sentinel-2 archive.
In addition to the pixel-wise, 3 class (CLEAR, CLOUD, CLOUD_SHADOW) segmentation masks, we also provide users with binary
classification "tags" for each subscene that can be used in testing to determine performance in specific circumstances. These include:
Wherever practical, cloud shadows were also annotated, however this was sometimes not possible due to high-relief terrain, or large ambiguities. In total, 424 were marked with shadows (if present), and 89 have shadows that were not annotatable due to very ambiguous shadow boundaries, or terrain that cast significant shadows. If users wish to train an algorithm specifically for cloud shadow masks, we advise them to remove those 89 images for which shadow was not possible, however, bear in mind that this will systematically reduce the difficulty of the shadow class compared to real-world use, as these contain the most difficult shadow examples.
In addition to the 20m sampled subscenes and masks, we also provide users with shapefiles that define the boundary of the mask on the original Sentinel-2 scene. If users wish to retrieve the L1C bands at their original resolutions, they can use these to do so.
Please see the README for further details on the dataset structure and more.
Contributions & Acknowledgements
The data were collected, annotated, checked, formatted and published by Alistair Francis and John Mrziglod.
Support and advice was provided by Prof. Jan-Peter Muller and Dr. Panagiotis Sidiropoulos, for which we are grateful.
We would like to extend our thanks to Dr. Pierre-Philippe Mathieu and the rest of the team at ESA PhiLab, who provided the environment in which this project was conceived, and continued to give technical support throughout.
Finally, we thank the ESA Network of Resources for sponsoring this project by providing ICT resources.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
RarePlanes is a unique open-source machine learning dataset from CosmiQ Works and AI.Reverie that incorporates both real and synthetically generated satellite imagery. The RarePlanes dataset specifically focuses on the value of AI.Reverie synthetic data to aid computer vision algorithms in their ability to automatically detect aircraft and their attributes in satellite imagery. Although other synthetic/real combination datasets exist, RarePlanes is the largest openly-available very-high resolution dataset built to test the value of synthetic data from an overhead perspective. Previous research has shown that synthetic data can reduce the amount of real training data needed and potentially improve performance for many tasks in the computer vision domain. The real portion of the dataset consists of 253 Maxar WorldView-3 satellite scenes spanning 112 locations and 2,142 km^2 with 14,700 hand-annotated aircraft. The accompanying synthetic dataset is generated via AI.Reverie’s novel simulation platform and features 50,000 synthetic satellite images with ~630,000 aircraft annotations. Both the real and synthetically generated aircraft feature 10 fine grain attributes including: aircraft length, wingspan, wing-shape, wing-position, wingspan class, propulsion, number of engines, number of vertical-stabilizers, presence of canards, and aircraft role. Finally, we conduct extensive experiments to evaluate the real and synthetic datasets and compare performances. By doing so, we show the value of synthetic data for the task of detecting and classifying aircraft from an overhead perspective.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Sentinel2GlobalLULC is a deep learning-ready dataset of RGB images from the Sentinel-2 satellites designed for global land use and land cover (LULC) mapping. Sentinel2GlobalLULC v2.1 contains 194,877 images in GeoTiff and JPEG format corresponding to 29 broad LULC classes. Each image has 224 x 224 pixels at 10 m spatial resolution and was produced by assigning the 25th percentile of all available observations in the Sentinel-2 collection between June 2015 and October 2020 in order to remove atmospheric effects (i.e., clouds, aerosols, shadows, snow, etc.). A spatial purity value was assigned to each image based on the consensus across 15 different global LULC products available in Google Earth Engine (GEE).
Our dataset is structured into 3 main zip-compressed folders, an Excel file with a dictionary for class names and descriptive statistics per LULC class, and a python script to convert RGB GeoTiff images into JPEG format. The first folder called "Sentinel2LULC_GeoTiff.zip" contains 29 zip-compressed subfolders where each one corresponds to a specific LULC class with hundreds to thousands of GeoTiff Sentinel-2 RGB images. The second folder called "Sentinel2LULC_JPEG.zip" contains 29 zip-compressed subfolders with a JPEG formatted version of the same images provided in the first main folder. The third folder called "Sentinel2LULC_CSV.zip" includes 29 zip-compressed CSV files with as many rows as provided images and with 12 columns containing the following metadata (this same metadata is provided in the image filenames):
For seven LULC classes, we could not export from GEE all images that fulfilled a spatial purity of 100% since there were millions of them. In this case, we exported a stratified random sample of 14,000 images and provided an additional CSV file with the images actually contained in our dataset. That is, for these seven LULC classes, we provide these 2 CSV files:
To clearly state the geographical coverage of images available in this dataset, we included in the version v2.1, a compressed folder called "Geographic_Representativeness.zip". This zip-compressed folder contains a csv file for each LULC class that provides the complete list of countries represented in that class. Each csv file has two columns, the first one gives the country code and the second one gives the number of images provided in that country for that LULC class. In addition to these 29 csv files, we provided another csv file that maps each ISO Alpha-2 country code to its original full country name.
© Sentinel2GlobalLULC Dataset by Yassir Benhammou, Domingo Alcaraz-Segura, Emilio Guirado, Rohaifa Khaldi, Boujemâa Achchab, Francisco Herrera & Siham Tabik is marked with Attribution 4.0 International (CC-BY 4.0)
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Google Earth Pro facilitated the acquisition of satellite imagery to monitor deforestation in Dhaka, Bangladesh. Multiple years of images were systematically captured from specific locations, allowing comprehensive analysis of tree cover reduction. The imagery displays diverse aspect ratios based on satellite perspectives and possesses high resolution, suitable for remote sensing. Each site provided 5 to 35 images annually, accumulating data over a ten-year period. The dataset classifies images into three primary categories: tree cover, deforested regions, and masked images. Organized by year, it comprises both raw and annotated images, each paired with a JSON file containing annotations and segmentation masks. This organization enhances accessibility and temporal analysis. Furthermore, the dataset is conducive to machine learning initiatives, particularly in training models for object detection and segmentation to evaluate environmental alterations.
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
By [source]
For more datasets, click here.
- 🚨 Your notebook can be here! 🚨!
In this guide, we will cover how to use this dataset and what information can be derived from it.
First, let’s take a look at the columns in the dataset. We have scene name, difficulty level, annotator name, shadows_marked (yes/no), clear percent, cloud percent, shadow percent, dataset type (WorldView 2 or 3), forest/jungle coverage percentage details etc., snow/ice coverage percentage details etc., agricultural coverage percentage details etc., urban/developed coverage percentage details etc., coastal coverage percentage details etc., hills/mountains coverage percentage details etc., desert/barren coverage percentage details etc., shrublands/plains coverage percentage details(etc.), wetland/bog marsh coverage%, open water%, enclosed water%, thin cloud % , thick clouds % , low clouds % , high clouds % , isolated clouds % along with extended cloud type (altocumulus / stratocumulus) cirrus haze / fog , ice_clouds & contrails . All of these columns provide detailed percentages about different types of landcover along with corresponding cloud types & other useful information like annotator name involved in creating annotation for a particular scene .
The data within each column can then be used to derive further insights about any given Sentinel-2 subscene including landcover as well as various associated meteorological events such as precipitation and wind patterns which could enable specific decision-making applications like crop monitoring or urban development tracking in addition to understanding environmental impacts over large areas easily visible through satellite imagery. Furthermore, by analyzing this data combined with other standard atmospheric parameters such as wind speed & direction it is possible to track storm path direction by looking at cyclonic activity predicted by different conditions pertaining to satellite images gathered previously allowing accurate forecasting opportunity .
- Using the geographical attributes associated with each scene, this dataset can be used to categorize cultures based on their characteristics and geography.
- This dataset can be used to better understand climate data, by looking at how cloud formations are distributed in a region and in relation to weather patterns.
- This dataset can also help with machine learning projects related to object detection, as the cloud patterns and layout of the scenes can be seen as objects that algorithms should try to recognize or identify correctly while training
If you use this dataset in your research, please credit the original authors. Data Source
License: CC0 1.0 Universal (CC0 1.0) - Public Domain Dedication No Copyright - You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information.
File: classification_tags.csv | Column name | Description | |:------------------------------|:------------------------------------------------------------------------| | scene | Unique identifier for each subscene. (String) | | difficulty | Difficulty rating of the subscene. (Integer) | | annotator | Name of the annotator who classified the subscene. (String) | | shadows_marked | Whether shadows were marked in the subscene. (Boolean) | | clear_percent | Percentage of clear sky in the subscene. (Float) | | cloud_percent | Percentage of clouds in the subscene. (Float) | | shadow_percent | Percentage of shadows in the subscene. (Float) | | dataset | Dataset the subscene was taken from. (String) | | forest/jungle | Percentage of forest/jungle in the subscene. (Float) | | snow/ice | Percentage of snow/ice in the subscene. (Float) | | agricultural ...
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
We present our dataset containing images with labeled polygons, annotated over Sentinel-2 L1C imagery from snow and ice-covered regions. We use similar labels as the Fmask cloud detection algorithm, i.e., clear-sky land, cloud, shadow, snow, and water. We annotated the labels manually using the QGIS software. The dataset consists of 45 scenes divided into validation (22 scenes) and test datasets (23 scenes). The source images were captured by the satellite between October 2019 and December 2020. We provide the list of '.SAFE' filenames containing the satellite imagery and these files can be downloaded from the Copernicus Open Access Hub. The dataset can be used to test and benchmark deep neural networks for the task of cloud, shadow, and snow segmentation.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description from the SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery GitHub Repository * The "Note" was added by the Roboflow team.
This is a single class dataset consisting of tiles of satellite imagery labeled with potential 'targets'. Labelers were instructed to draw boxes around anything they suspect may a paraglider wing, missing in a remote area of Nevada. Volunteers were shown examples of similar objects already in the environment for comparison. The missing wing, as it was found after 3 weeks, is shown below.
https://michaeltpublic.s3.amazonaws.com/images/anomaly_small.jpg" alt="anomaly">
The dataset contains the following:
Set | Images | Annotations |
---|---|---|
Train | 1808 | 3048 |
Validate | 490 | 747 |
Test | 254 | 411 |
Total | 2552 | 4206 |
The data is in the COCO format, and is directly compatible with faster r-cnn as implemented in Facebook's Detectron2.
Download the data here: sarnet.zip
Or follow these steps
# download the dataset
wget https://michaeltpublic.s3.amazonaws.com/sarnet.zip
# extract the files
unzip sarnet.zip
***Note* with Roboflow, you can download the data here** (original, raw images, with annotations): https://universe.roboflow.com/roboflow-public/sarnet-search-and-rescue/ (download v1, original_raw-images) * Download the dataset in COCO JSON format, or another format of choice, and import them to Roboflow after unzipping the folder to get started on your project.
Get started with a Faster R-CNN model pretrained on SaRNet: SaRNet_Demo.ipynb
Source code for the paper is located here: SaRNet_train_test.ipynb
@misc{thoreau2021sarnet,
title={SaRNet: A Dataset for Deep Learning Assisted Search and Rescue with Satellite Imagery},
author={Michael Thoreau and Frazer Wilson},
year={2021},
eprint={2107.12469},
archivePrefix={arXiv},
primaryClass={eess.IV}
}
The source data was generously provided by Planet Labs, Airbus Defence and Space, and Maxar Technologies.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Context and Aim
Deep learning in Earth Observation requires large image archives with highly reliable labels for model training and testing. However, a preferable quality standard for forest applications in Europe has not yet been determined. The TreeSatAI consortium investigated numerous sources for annotated datasets as an alternative to manually labeled training datasets.
We found the federal forest inventory of Lower Saxony, Germany represents an unseen treasure of annotated samples for training data generation. The respective 20-cm Color-infrared (CIR) imagery, which is used for forestry management through visual interpretation, constitutes an excellent baseline for deep learning tasks such as image segmentation and classification.
Description
The data archive is highly suitable for benchmarking as it represents the real-world data situation of many German forest management services. One the one hand, it has a high number of samples which are supported by the high-resolution aerial imagery. On the other hand, this data archive presents challenges, including class label imbalances between the different forest stand types.
The TreeSatAI Benchmark Archive contains:
50,381 image triplets (aerial, Sentinel-1, Sentinel-2)
synchronized time steps and locations
all original spectral bands/polarizations from the sensors
20 species classes (single labels)
12 age classes (single labels)
15 genus classes (multi labels)
60 m and 200 m patches
fixed split for train (90%) and test (10%) data
additional single labels such as English species name, genus, forest stand type, foliage type, land cover
The geoTIFF and GeoJSON files are readable in any GIS software, such as QGIS. For further information, we refer to the PDF document in the archive and publications in the reference section.
Version history
v1.0.0 - First release
Citation
Ahlswede et al. (in prep.)
GitHub
Full code examples and pre-trained models from the dataset article (Ahlswede et al. 2022) using the TreeSatAI Benchmark Archive are published on the GitHub repositories of the Remote Sensing Image Analysis (RSiM) Group (https://git.tu-berlin.de/rsim/treesat_benchmark). Code examples for the sampling strategy can be made available by Christian Schulz via email request.
Folder structure
We refer to the proposed folder structure in the PDF file.
Folder “aerial” contains the aerial imagery patches derived from summertime orthophotos of the years 2011 to 2020. Patches are available in 60 x 60 m (304 x 304 pixels). Band order is near-infrared, red, green, and blue. Spatial resolution is 20 cm.
Folder “s1” contains the Sentinel-1 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is VV, VH, and VV/VH ratio. Spatial resolution is 10 m.
Folder “s2” contains the Sentinel-2 imagery patches derived from summertime mosaics of the years 2015 to 2020. Patches are available in 60 x 60 m (6 x 6 pixels) and 200 x 200 m (20 x 20 pixels). Band order is B02, B03, B04, B08, B05, B06, B07, B8A, B11, B12, B01, and B09. Spatial resolution is 10 m.
The folder “labels” contains a JSON string which was used for multi-labeling of the training patches. Code example of an image sample with respective proportions of 94% for Abies and 6% for Larix is: "Abies_alba_3_834_WEFL_NLF.tif": [["Abies", 0.93771], ["Larix", 0.06229]]
The two files “test_filesnames.lst” and “train_filenames.lst” define the filenames used for train (90%) and test (10%) split. We refer to this fixed split for better reproducibility and comparability.
The folder “geojson” contains geoJSON files with all the samples chosen for the derivation of training patch generation (point, 60 m bounding box, 200 m bounding box).
CAUTION: As we could not upload the aerial patches as a single zip file on Zenodo, you need to download the 20 single species files (aerial_60m_…zip) separately. Then, unzip them into a folder named “aerial” with a subfolder named “60m”. This structure is recommended for better reproducibility and comparability to the experimental results of Ahlswede et al. (2022),
Join the archive
Model training, benchmarking, algorithm development… many applications are possible! Feel free to add samples from other regions in Europe or even worldwide. Additional remote sensing data from Lidar, UAVs or aerial imagery from different time steps are very welcome. This helps the research community in development of better deep learning and machine learning models for forest applications. You might have questions or want to share code/results/publications using that archive? Feel free to contact the authors.
Project description
This work was part of the project TreeSatAI (Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees at Infrastructures, Nature Conservation Sites and Forests). Its overall aim is the development of AI methods for the monitoring of forests and woody features on a local, regional and global scale. Based on freely available geodata from different sources (e.g., remote sensing, administration maps, and social media), prototypes will be developed for the deep learning-based extraction and classification of tree- and tree stand features. These prototypes deal with real cases from the monitoring of managed forests, nature conservation and infrastructures. The development of the resulting services by three enterprises (liveEO, Vision Impulse and LUP Potsdam) will be supported by three research institutes (German Research Center for Artificial Intelligence, TU Remote Sensing Image Analysis Group, TUB Geoinformation in Environmental Planning Lab).
Publications
Ahlswede et al. (2022, in prep.): TreeSatAI Dataset Publication
Ahlswede S., Nimisha, T.M., and Demir, B. (2022, in revision): Embedded Self-Enhancement Maps for Weakly Supervised Tree Species Mapping in Remote Sensing Images. IEEE Trans Geosci Remote Sens
Schulz et al. (2022, in prep.): Phenoprofiling
Conference contributions
S. Ahlswede, N. T. Madam, C. Schulz, B. Kleinschmit and B. Demіr, "Weakly Supervised Semantic Segmentation of Remote Sensing Images for Tree Species Classification Based on Explanation Methods", IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, “Exploring the temporal fingerprints of mid-European forest types from Sentinel-1 RVI and Sentinel-2 NDVI time series”, IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.
C. Schulz, M. Förster, S. Vulova and B. Kleinschmit, “The temporal fingerprints of common European forest types from SAR and optical remote sensing data”, AGU Fall Meeting, New Orleans, USA, 2021.
B. Kleinschmit, M. Förster, C. Schulz, F. Arias, B. Demir, S. Ahlswede, A. K. Aksoy, T. Ha Minh, J. Hees, C. Gava, P. Helber, B. Bischke, P. Habelitz, A. Frick, R. Klinke, S. Gey, D. Seidel, S. Przywarra, R. Zondag and B. Odermatt, “Artificial Intelligence with Satellite data and Multi-Source Geodata for Monitoring of Trees and Forests”, Living Planet Symposium, Bonn, Germany, 2022.
C. Schulz, M. Förster, S. Vulova, T. Gränzig and B. Kleinschmit, (2022, submitted): “Exploring the temporal fingerprints of sixteen mid-European forest types from Sentinel-1 and Sentinel-2 time series”, ForestSAT, Berlin, Germany, 2022.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset provides annotated very-high-resolution satellite RGB images extracted from Google Earth to train deep learning models to recognize Juniperus communis L. and Juniperus sabina L. shrubs. All images are from the high mountain of Sierra Nevada in Spain. The dataset contains 2000 images (.jpg) of size 512x512 pixels partitioned into two classes: Shrubs and NoShrubs. We also provide partitioning of the data into Train (1800 images), Test (100 images), and Validation (100 images) subsets.