Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.
We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)
On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021.The IRS has requested public inquiries be directed to +1-800-829-1040.Machine-readable data from certain electronic 990 forms filed with the IRS from 2013 to present.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.
SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).
New imagery and features are added quarterly
Various (See here for more details)
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Voice-based refinements of product search
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Original StackExchange answers and their voice-friendly Reformulation.
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.
This joint NASA/USGS program provides the longest continuous space-based record of
Earth’s land in existence. Every day, Landsat satellites provide essential information
to help land managers and policy makers make wise decisions about our resources and our environment.
Data is provided for Landsats 1, 2, 3, 4, 5, 7, 8, and 9 (excludes Landsat 6).As of June 28, 2023 (announcement),
the previous single SNS topic arn:aws:sns:us-west-2:673253540267:public-c2-notify
was replaced with
three new SNS topics for different types of scenes.
This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Light Every Night - World Bank Nighttime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is published and openly available under the terms of the World Bank’s open data license.
Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.
Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).
Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.
Terrestrial CDRs are composed of sensor data that have been improved and quality controlled over time, together with ancillary calibration data.
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.