100+ datasets found

Registry of Open Data on AWS
registry.opendata.aws
Updated Aug 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
Explore at:
Dataset updated
Aug 13, 2021
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
Amazon Web Serviceshttps://aws.amazon.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.
Data from: The Multilingual Amazon Reviews Corpus
registry.opendata.aws
Updated May 28, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2020). The Multilingual Amazon Reviews Corpus [Dataset]. https://registry.opendata.aws/amazon-reviews-ml/
Explore at:
Dataset updated
May 28, 2020
Dataset provided by
Amazon.comhttp://amazon.com/
Description
We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)
IRS 990 Filings
registry.opendata.aws
Updated Dec 16, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
The Internal Revenue Service (2021). IRS 990 Filings [Dataset]. https://registry.opendata.aws/irs990/
Explore at:
Dataset updated
Dec 16, 2021
Dataset provided by
Internal Revenue Servicehttp://www.irs.gov/
Description
On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021.The IRS has requested public inquiries be directed to +1-800-829-1040.Machine-readable data from certain electronic 990 forms filed with the IRS from 2013 to present.
MultiCoNER Datasets
registry.opendata.aws
Updated Mar 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). MultiCoNER Datasets [Dataset]. https://registry.opendata.aws/multiconer/
Explore at:
Dataset updated
Mar 26, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.
Oxford Nanopore Technologies Benchmark Datasets
registry.opendata.aws
Updated Sep 29, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxford Nanopore Technologies (2020). Oxford Nanopore Technologies Benchmark Datasets [Dataset]. https://registry.opendata.aws/ont-open-data/
Explore at:
Dataset updated
Sep 29, 2020
Dataset provided by
Oxford Nanopore Technologieshttp://nanoporetech.com/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.
Multi Token Completion
registry.opendata.aws
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
Explore at:
Dataset updated
Feb 11, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
o
NEXRAD on AWS
registry.opendata.aws
s.cnmilf.com
Updated Apr 19, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unidata (2018). NEXRAD on AWS [Dataset]. https://registry.opendata.aws/noaa-nexrad/
Explore at:
Dataset updated
Apr 19, 2018
Dataset provided by
<a href="https://www.unidata.ucar.edu/">Unidata</a>
Description
Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
o
Sentinel-1
registry.opendata.aws
data.subak.org
Updated Apr 20, 2018
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sinergise (2018). Sentinel-1 [Dataset]. https://registry.opendata.aws/sentinel-1/
Explore at:
Dataset updated
Apr 20, 2018
Dataset provided by
<a href="https://www.sinergise.com/">Sinergise</a>
Description
Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.
S
SpaceNet
data.subak.org
registry.opendata.aws
Updated Feb 16, 2023
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Spacenet (2023). SpaceNet [Dataset]. https://data.subak.org/dataset/spacenet
Explore at:
Dataset updated
Feb 16, 2023
Dataset provided by
Spacenet
Description
SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

Documentation

https://spacenet.ai/

Update Frequency

New imagery and features are added quarterly

License

Various (See here for more details)
Pre- and post-purchase product questions
registry.opendata.aws
Updated May 24, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2021). Pre- and post-purchase product questions [Dataset]. https://registry.opendata.aws/pre-post-purchase-questions/
Explore at:
Dataset updated
May 24, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.
VoiSeR
registry.opendata.aws
Updated Apr 12, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2021). VoiSeR [Dataset]. https://registry.opendata.aws/amazon-conversational-product-search/
Explore at:
Dataset updated
Apr 12, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Voice-based refinements of product search
iNaturalist Licensed Observation Images
registry.opendata.aws
data.subak.org
Updated Mar 23, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
iNaturalist is an independent, tax-exempt, 501(c)(3), not-for-profit organization based in the United States of America (EIN/Tax ID: 92-1296468). (2021). iNaturalist Licensed Observation Images [Dataset]. https://registry.opendata.aws/inaturalist-open-data/
Explore at:
Dataset updated
Mar 23, 2021
Dataset provided by
iNaturalisthttp://inaturalist.org/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.
Answer Reformulation
registry.opendata.aws
Updated Apr 10, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2020). Answer Reformulation [Dataset]. https://registry.opendata.aws/answer-reformulation/
Explore at:
Dataset updated
Apr 10, 2020
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
Original StackExchange answers and their voice-friendly Reformulation.
Speedtest by Ookla Global Fixed and Mobile Network Performance Maps
registry.opendata.aws
Updated Sep 30, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Ookla (2020). Speedtest by Ookla Global Fixed and Mobile Network Performance Maps [Dataset]. https://registry.opendata.aws/speedtest-global-performance/
Explore at:
Dataset updated
Sep 30, 2020
Dataset provided by
Speedtest.nethttp://www.speedtest.net/
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.
o
USGS Landsat
registry.opendata.aws
Updated Apr 19, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
United States Geological Survey (2018). USGS Landsat [Dataset]. https://registry.opendata.aws/usgs-landsat/
Explore at:
Dataset updated
Apr 19, 2018
Dataset provided by
United States Geological Surveyhttp://www.usgs.gov/
Description
This joint NASA/USGS program provides the longest continuous space-based record of Earth’s land in existence. Every day, Landsat satellites provide essential information to help land managers and policy makers make wise decisions about our resources and our environment. Data is provided for Landsats 1, 2, 3, 4, 5, 7, 8, and 9 (excludes Landsat 6).As of June 28, 2023 (announcement), the previous single SNS topic arn:aws:sns:us-west-2:673253540267:public-c2-notify was replaced with three new SNS topics for different types of scenes.
Learning to Rank and Filter - community question answering
registry.opendata.aws
Updated Apr 22, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). Learning to Rank and Filter - community question answering [Dataset]. https://registry.opendata.aws/ltrf-cqa-dataset/
Explore at:
Dataset updated
Apr 22, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
Description
This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.
World Bank - Light Every Night
registry.opendata.aws
Updated Jan 21, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
World Bank Group (2021). World Bank - Light Every Night [Dataset]. https://registry.opendata.aws/wb-light-every-night/
Explore at:
Dataset updated
Jan 21, 2021
Dataset provided by
World Bankhttp://worldbank.org/
License
Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
License information was derived automatically
Description
Light Every Night - World Bank Nighttime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is published and openly available under the terms of the World Bank’s open data license.
Airborne Object Tracking Dataset
registry.opendata.aws
Updated Jun 22, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Airborne Object Tracking Dataset [Dataset]. https://registry.opendata.aws/airborne-object-tracking/
Explore at:
Dataset updated
Jun 22, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
Description
Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.
o
Storm EVent ImageRy (SEVIR)
registry.opendata.aws
data.subak.org
Updated Apr 24, 2020
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Mark S. Veillette (2020). Storm EVent ImageRy (SEVIR) [Dataset]. https://registry.opendata.aws/sevir/
Explore at:
Dataset updated
Apr 24, 2020
Dataset provided by
Mark S. Veillette
Description
Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.
NOAA Terrestrial Climate Data Records
registry.opendata.aws
data.subak.org
Updated Jul 17, 2021
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2021). NOAA Terrestrial Climate Data Records [Dataset]. https://registry.opendata.aws/noaa-cdr-terrestrial/
Explore at:
Dataset updated
Jul 17, 2021
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
License
CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
License information was derived automatically
Description
NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

Terrestrial CDRs are composed of sensor data that have been improved and quality controlled over time, together with ancillary calibration data.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/

Registry of Open Data on AWS

Explore at:

Dataset updated

Aug 13, 2021

Dataset provided by

Amazon Web Serviceshttp://aws.amazon.com/
Amazon Web Serviceshttps://aws.amazon.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

Clear search

Close search

Google apps

Main menu

Registry of Open Data on AWS

Data from: The Multilingual Amazon Reviews Corpus

IRS 990 Filings

MultiCoNER Datasets

Oxford Nanopore Technologies Benchmark Datasets

Multi Token Completion

NEXRAD on AWS

Sentinel-1

SpaceNet

Documentation

Update Frequency

License

Pre- and post-purchase product questions

VoiSeR

iNaturalist Licensed Observation Images

Answer Reformulation

Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

USGS Landsat

Learning to Rank and Filter - community question answering

World Bank - Light Every Night

Airborne Object Tracking Dataset

Storm EVent ImageRy (SEVIR)

NOAA Terrestrial Climate Data Records

Registry of Open Data on AWS