100+ datasets found
  1. Registry of Open Data on AWS

    • registry.opendata.aws
    Updated Aug 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
    Explore at:
    Dataset updated
    Aug 13, 2021
    Dataset provided by
    Amazon Web Serviceshttp://aws.amazon.com/
    Amazon Web Serviceshttps://aws.amazon.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

  2. OpenStreetMap on AWS

    • registry.opendata.aws
    Updated May 9, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenStreetMap Foundation (OSMF) and Pacific Atlas (2025). OpenStreetMap on AWS [Dataset]. https://registry.opendata.aws/osm/
    Explore at:
    Dataset updated
    May 9, 2025
    Dataset provided by
    OpenStreetMap//www.openstreetmap.org/
    Description

    OSM is a free, editable map of the world, created and maintained by volunteers. Regular OSM data archives are made available in Amazon S3 in both standard formats (OSM PBF, XML) and cloud-native formats optimized for analytics workloads.

  3. MultiCoNER Datasets

    • registry.opendata.aws
    Updated Mar 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). MultiCoNER Datasets [Dataset]. https://registry.opendata.aws/multiconer/
    Explore at:
    Dataset updated
    Mar 26, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.

  4. IRS 990 Filings

    • registry.opendata.aws
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Internal Revenue Service (2021). IRS 990 Filings [Dataset]. https://registry.opendata.aws/irs990/
    Explore at:
    Dataset updated
    Dec 16, 2021
    Dataset provided by
    Internal Revenue Servicehttp://www.irs.gov/
    Description

    On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021.The IRS has requested public inquiries be directed to +1-800-829-1040.Machine-readable data from certain electronic 990 forms filed with the IRS from 2013 to present.

  5. o

    NAIP on AWS

    • registry.opendata.aws
    Updated Apr 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Esri (2018). NAIP on AWS [Dataset]. https://registry.opendata.aws/naip/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    <a href="https://www.esri.com/en-us/home">Esri</a>
    Description

    The National Agriculture Imagery Program (NAIP) acquires aerial imagery during the agricultural growing seasons in the continental U.S. This "leaf-on" imagery andtypically ranges from 30 centimeters to 100 centimeters in resolution and is available from the naip-analytic Amazon S3 bucket as 4-band (RGB + NIR) imagery in MRF format, on naip-source Amazon S3 bucket as 4-band (RGB + NIR) in uncompressed Raw GeoTiff format and naip-visualization as 3-band (RGB) Cloud Optimized GeoTiff format. More details on NAIP

  6. Oxford Nanopore Technologies Benchmark Datasets

    • registry.opendata.aws
    Updated Sep 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Nanopore Technologies (2020). Oxford Nanopore Technologies Benchmark Datasets [Dataset]. https://registry.opendata.aws/ont-open-data/
    Explore at:
    Dataset updated
    Sep 29, 2020
    Dataset provided by
    Oxford Nanopore Technologieshttp://nanoporetech.com/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

  7. Data from: The Multilingual Amazon Reviews Corpus

    • registry.opendata.aws
    Updated May 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). The Multilingual Amazon Reviews Corpus [Dataset]. https://registry.opendata.aws/amazon-reviews-ml/
    Explore at:
    Dataset updated
    May 28, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)

  8. iNaturalist Licensed Observation Images

    • registry.opendata.aws
    Updated Mar 23, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iNaturalist is an independent, tax-exempt, 501(c)(3), not-for-profit organization based in the United States of America (EIN/Tax ID: 92-1296468). (2021). iNaturalist Licensed Observation Images [Dataset]. https://registry.opendata.aws/inaturalist-open-data/
    Explore at:
    Dataset updated
    Mar 23, 2021
    Dataset provided by
    iNaturalisthttp://inaturalist.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.

  9. o

    NEXRAD on AWS

    • registry.opendata.aws
    Updated Apr 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2018). NEXRAD on AWS [Dataset]. https://registry.opendata.aws/noaa-nexrad/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    <a href="https://www.unidata.ucar.edu/">Unidata</a>
    Description

    Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.

  10. Amazon Bin Image Dataset File List

    • kaggle.com
    Updated Apr 23, 2022
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    William Hyun (2022). Amazon Bin Image Dataset File List [Dataset]. https://www.kaggle.com/datasets/williamhyun/amazon-bin-image-dataset-file-list
    Explore at:
    CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
    Dataset updated
    Apr 23, 2022
    Dataset provided by
    Kagglehttp://kaggle.com/
    Authors
    William Hyun
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Amazon Bin Image Dataset

    The Amazon Bin Image Dataset contains 536,434 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. This dataset has many images and the corresponding medadata.

    The image files have three groups according to its naming scheme.

    • A file name with 1~4 digits (1,200): 1.jpg ~ 1200.jpg
    • A file name with 5 digits (99,999): 00001.jpg ~ 99999.jpg
    • A file name with 6 digits (435,235): 100000.jpg ~ 535234.jpg

    Amazon Bin Image Dataset File List dataset aims to provide a CSV file to contain all file locations and the quantity to help the analysis and distributed learning.

    Documentation

    Download

  11. o

    SpaceNet

    • registry.opendata.aws
    Updated Aug 15, 2016
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    SpaceNet (2016). SpaceNet [Dataset]. https://registry.opendata.aws/spacenet/
    Explore at:
    Dataset updated
    Aug 15, 2016
    Dataset provided by
    <a href="https://spacenet.ai/">SpaceNet</a>
    Description

    SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

  12. o

    OpenAQ

    • registry.opendata.aws
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    OpenAQ (2018). OpenAQ [Dataset]. https://registry.opendata.aws/openaq/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Openaq Inc.
    Description

    Global, aggregated physical air quality data from public data sources provided by government, research-grade and other sources. These awesome groups do the hard work of measuring these data and publicly sharing them, and our community makes them more universally-accessible to both humans and machines.

  13. Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

    • registry.opendata.aws
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ookla (2020). Speedtest by Ookla Global Fixed and Mobile Network Performance Maps [Dataset]. https://registry.opendata.aws/speedtest-global-performance/
    Explore at:
    Dataset updated
    Sep 30, 2020
    Dataset provided by
    Ooklahttps://www.ookla.com/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.

  14. o

    Data from: Sentinel-2

    • registry.opendata.aws
    Updated Apr 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinergise (2018). Sentinel-2 [Dataset]. https://registry.opendata.aws/sentinel-2/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    <a href="https://www.sinergise.com/">Sinergise</a>
    Description

    The Sentinel-2 mission is a land monitoring constellation of two satellites that provide high resolution optical imagery and provide continuity for the current SPOT and Landsat missions. The mission provides a global coverage of the Earth's land surface every 5 days, making the data of great use in on-going studies. L1C data are available from June 2015 globally. L2A data are available from November 2016 over Europe region and globally since January 2017.

  15. o

    Sentinel-1

    • registry.opendata.aws
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinergise (2018). Sentinel-1 [Dataset]. https://registry.opendata.aws/sentinel-1/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    <a href="https://www.sinergise.com/">Sinergise</a>
    Description

    Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.

  16. o

    Global Database of Events, Language and Tone (GDELT)

    • registry.opendata.aws
    Updated Apr 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unmanaged (2018). Global Database of Events, Language and Tone (GDELT) [Dataset]. https://registry.opendata.aws/gdelt/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    Unmanaged
    Description

    This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.

  17. Multi Token Completion

    • registry.opendata.aws
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
    Explore at:
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

  18. Pre- and post-purchase product questions

    • registry.opendata.aws
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). Pre- and post-purchase product questions [Dataset]. https://registry.opendata.aws/pre-post-purchase-questions/
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.

  19. Airborne Object Tracking Dataset

    • registry.opendata.aws
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). Airborne Object Tracking Dataset [Dataset]. https://registry.opendata.aws/airborne-object-tracking/
    Explore at:
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.

  20. Terrain Tiles

    • registry.opendata.aws
    Updated Apr 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mapzen, a Linux Foundation project (2018). Terrain Tiles [Dataset]. https://registry.opendata.aws/terrain-tiles/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    Mapzen
    Description

    A global dataset providing bare-earth terrain heights, tiled for easy usage and provided on S3.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
Organization logoOrganization logo

Registry of Open Data on AWS

Explore at:
Dataset updated
Aug 13, 2021
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
Amazon Web Serviceshttps://aws.amazon.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

Search
Clear search
Close search
Google apps
Main menu