100+ datasets found
  1. Registry of Open Data on AWS

    • registry.opendata.aws
    Updated Aug 13, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
    Explore at:
    Dataset updated
    Aug 13, 2021
    Dataset provided by
    Amazon Web Serviceshttp://aws.amazon.com/
    Amazon Web Serviceshttps://aws.amazon.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

  2. Data from: The Multilingual Amazon Reviews Corpus

    • registry.opendata.aws
    Updated May 28, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). The Multilingual Amazon Reviews Corpus [Dataset]. https://registry.opendata.aws/amazon-reviews-ml/
    Explore at:
    Dataset updated
    May 28, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    We present a collection of Amazon reviews specifically designed to aid research in multilingual text classification. The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. 'books', 'appliances', etc.)

  3. IRS 990 Filings

    • registry.opendata.aws
    Updated Dec 16, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    The Internal Revenue Service (2021). IRS 990 Filings [Dataset]. https://registry.opendata.aws/irs990/
    Explore at:
    Dataset updated
    Dec 16, 2021
    Dataset provided by
    Internal Revenue Servicehttp://www.irs.gov/
    Description

    On December 16, 2021 the IRS announced that it would discontinue updates to the IRS 990 Filings dataset on AWS, starting December 31, 2021.The IRS has requested public inquiries be directed to +1-800-829-1040.Machine-readable data from certain electronic 990 forms filed with the IRS from 2013 to present.

  4. MultiCoNER Datasets

    • registry.opendata.aws
    Updated Mar 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). MultiCoNER Datasets [Dataset]. https://registry.opendata.aws/multiconer/
    Explore at:
    Dataset updated
    Mar 26, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.

  5. Oxford Nanopore Technologies Benchmark Datasets

    • registry.opendata.aws
    Updated Sep 29, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Oxford Nanopore Technologies (2020). Oxford Nanopore Technologies Benchmark Datasets [Dataset]. https://registry.opendata.aws/ont-open-data/
    Explore at:
    Dataset updated
    Sep 29, 2020
    Dataset provided by
    Oxford Nanopore Technologieshttp://nanoporetech.com/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The ont-open-data registry provides reference sequencing data from Oxford Nanopore Technologies to support, 1) Exploration of the characteristics of nanopore sequence data. 2) Assessment and reproduction of performance benchmarks 3) Development of tools and methods. The data deposited showcases DNA sequences from a representative subset of sequencing chemistries. The datasets correspond to publicly-available reference samples (e.g. Genome In A Bottle reference cell lines). Raw data are provided with metadata and scripts to describe sample and data provenance.

  6. Multi Token Completion

    • registry.opendata.aws
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
    Explore at:
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

  7. o

    NEXRAD on AWS

    • registry.opendata.aws
    • s.cnmilf.com
    Updated Apr 19, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unidata (2018). NEXRAD on AWS [Dataset]. https://registry.opendata.aws/noaa-nexrad/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    <a href="https://www.unidata.ucar.edu/">Unidata</a>
    Description

    Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.

  8. o

    Sentinel-1

    • registry.opendata.aws
    • data.subak.org
    Updated Apr 20, 2018
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sinergise (2018). Sentinel-1 [Dataset]. https://registry.opendata.aws/sentinel-1/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    <a href="https://www.sinergise.com/">Sinergise</a>
    Description

    Sentinel-1 is a pair of European radar imaging (SAR) satellites launched in 2014 and 2016. Its 6 days revisit cycle and ability to observe through clouds makes it perfect for sea and land monitoring, emergency response due to environmental disasters, and economic applications. This dataset represents the global Sentinel-1 GRD archive, from beginning to the present, converted to cloud-optimized GeoTIFF format.

  9. S

    SpaceNet

    • data.subak.org
    • registry.opendata.aws
    Updated Feb 16, 2023
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Spacenet (2023). SpaceNet [Dataset]. https://data.subak.org/dataset/spacenet
    Explore at:
    Dataset updated
    Feb 16, 2023
    Dataset provided by
    Spacenet
    Description

    SpaceNet, launched in August 2016 as an open innovation project offering a repository of freely available imagery with co-registered map features. Before SpaceNet, computer vision researchers had minimal options to obtain free, precision-labeled, and high-resolution satellite imagery. Today, SpaceNet hosts datasets developed by its own team, along with data sets from projects like IARPA’s Functional Map of the World (fMoW).

    Documentation

    https://spacenet.ai/

    Update Frequency

    New imagery and features are added quarterly

    License

    Various (See here for more details)

  10. Pre- and post-purchase product questions

    • registry.opendata.aws
    Updated May 24, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). Pre- and post-purchase product questions [Dataset]. https://registry.opendata.aws/pre-post-purchase-questions/
    Explore at:
    Dataset updated
    May 24, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
    License information was derived automatically

    Description

    This dataset provides product related questions, including their textual content and gap, in hours, between purchase and posting time. Each question is also associated with related product details, including its id and title.

  11. VoiSeR

    • registry.opendata.aws
    Updated Apr 12, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). VoiSeR [Dataset]. https://registry.opendata.aws/amazon-conversational-product-search/
    Explore at:
    Dataset updated
    Apr 12, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Voice-based refinements of product search

  12. iNaturalist Licensed Observation Images

    • registry.opendata.aws
    • data.subak.org
    Updated Mar 23, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    iNaturalist is an independent, tax-exempt, 501(c)(3), not-for-profit organization based in the United States of America (EIN/Tax ID: 92-1296468). (2021). iNaturalist Licensed Observation Images [Dataset]. https://registry.opendata.aws/inaturalist-open-data/
    Explore at:
    Dataset updated
    Mar 23, 2021
    Dataset provided by
    iNaturalisthttp://inaturalist.org/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    iNaturalist is a community science effort in which participants share observations of living organisms that they encounter and document with photographic evidence, location, and date. The community works together reviewing these images to identify these observations to species. This collection represents the licensed images accompanying iNaturalist observations.

  13. Answer Reformulation

    • registry.opendata.aws
    Updated Apr 10, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2020). Answer Reformulation [Dataset]. https://registry.opendata.aws/answer-reformulation/
    Explore at:
    Dataset updated
    Apr 10, 2020
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    Original StackExchange answers and their voice-friendly Reformulation.

  14. Speedtest by Ookla Global Fixed and Mobile Network Performance Maps

    • registry.opendata.aws
    Updated Sep 30, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Ookla (2020). Speedtest by Ookla Global Fixed and Mobile Network Performance Maps [Dataset]. https://registry.opendata.aws/speedtest-global-performance/
    Explore at:
    Dataset updated
    Sep 30, 2020
    Dataset provided by
    Speedtest.nethttp://www.speedtest.net/
    License

    Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
    License information was derived automatically

    Description

    Global fixed broadband and mobile (cellular) network performance, allocated to zoom level 16 web mercator tiles (approximately 610.8 meters by 610.8 meters at the equator). Data is provided in both Shapefile format as well as Apache Parquet with geometries represented in Well Known Text (WKT) projected in EPSG:4326. Download speed, upload speed, and latency are collected via the Speedtest by Ookla applications for Android and iOS and averaged for each tile. Measurements are filtered to results containing GPS-quality location accuracy.

  15. o

    USGS Landsat

    • registry.opendata.aws
    Updated Apr 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    United States Geological Survey (2018). USGS Landsat [Dataset]. https://registry.opendata.aws/usgs-landsat/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    United States Geological Surveyhttp://www.usgs.gov/
    Description

    This joint NASA/USGS program provides the longest continuous space-based record of Earth’s land in existence. Every day, Landsat satellites provide essential information to help land managers and policy makers make wise decisions about our resources and our environment. Data is provided for Landsats 1, 2, 3, 4, 5, 7, 8, and 9 (excludes Landsat 6).As of June 28, 2023 (announcement), the previous single SNS topic arn:aws:sns:us-west-2:673253540267:public-c2-notify was replaced with three new SNS topics for different types of scenes.

  16. Learning to Rank and Filter - community question answering

    • registry.opendata.aws
    Updated Apr 22, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). Learning to Rank and Filter - community question answering [Dataset]. https://registry.opendata.aws/ltrf-cqa-dataset/
    Explore at:
    Dataset updated
    Apr 22, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    This dataset provides product related questions and answers, including answers' quality labels, as as part of the paper 'IR Evaluation and Learning in the Presence of Forbidden Documents'.

  17. World Bank - Light Every Night

    • registry.opendata.aws
    Updated Jan 21, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    World Bank Group (2021). World Bank - Light Every Night [Dataset]. https://registry.opendata.aws/wb-light-every-night/
    Explore at:
    Dataset updated
    Jan 21, 2021
    Dataset provided by
    World Bankhttp://worldbank.org/
    License

    Open Database License (ODbL) v1.0https://www.opendatacommons.org/licenses/odbl/1.0/
    License information was derived automatically

    Description

    Light Every Night - World Bank Nighttime Light Data – provides open access to all nightly imagery and data from the Visible Infrared Imaging Radiometer Suite Day-Night Band (VIIRS DNB) from 2012-2020 and the Defense Meteorological Satellite Program Operational Linescan System (DMSP-OLS) from 1992-2013. The underlying data are sourced from the NOAA National Centers for Environmental Information (NCEI) archive. Additional processing by the University of Michigan enables access in Cloud Optimized GeoTIFF format (COG) and search using the Spatial Temporal Asset Catalog (STAC) standard. The data is published and openly available under the terms of the World Bank’s open data license.

  18. Airborne Object Tracking Dataset

    • registry.opendata.aws
    Updated Jun 22, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Airborne Object Tracking Dataset [Dataset]. https://registry.opendata.aws/airborne-object-tracking/
    Explore at:
    Dataset updated
    Jun 22, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    Airborne Object Tracking (AOT) is a collection of 4,943 flight sequences of around 120 seconds each, collected at 10 Hz in diverse conditions. There are 5.9M+ images and 3.3M+ 2D annotations of airborne objects in the sequences. There are 3,306,350 frames without labels as they contain no airborne objects. For images with labels, there are on average 1.3 labels per image. All airborne objects in the dataset are labelled.

  19. o

    Storm EVent ImageRy (SEVIR)

    • registry.opendata.aws
    • data.subak.org
    Updated Apr 24, 2020
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Mark S. Veillette (2020). Storm EVent ImageRy (SEVIR) [Dataset]. https://registry.opendata.aws/sevir/
    Explore at:
    Dataset updated
    Apr 24, 2020
    Dataset provided by
    Mark S. Veillette
    Description

    Collection of spatially and temporally aligned GOES-16 ABI satellite imagery, NEXRAD radar mosaics, and GOES-16 GLM lightning detections.

  20. NOAA Terrestrial Climate Data Records

    • registry.opendata.aws
    • data.subak.org
    Updated Jul 17, 2021
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    NOAA (2021). NOAA Terrestrial Climate Data Records [Dataset]. https://registry.opendata.aws/noaa-cdr-terrestrial/
    Explore at:
    Dataset updated
    Jul 17, 2021
    Dataset provided by
    National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
    License

    CC0 1.0 Universal Public Domain Dedicationhttps://creativecommons.org/publicdomain/zero/1.0/
    License information was derived automatically

    Description

    NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and to what extent the land, oceans, atmosphere and ice sheets are changing. These datasets are thoroughly vetted time series measurements with the longevity, consistency, and continuity to assess and measure climate variability and change. NOAA CDRs are vetted using standards established by the National Research Council (NRC).

    Climate Data Records are created by merging data from surface, atmosphere, and space-based systems across decades. NOAA’s Climate Data Records provides authoritative and traceable long-term climate records. NOAA developed CDRs by applying modern data analysis methods to historical global satellite data. This process can clarify the underlying climate trends within the data and allows researchers and other users to identify economic and scientific value in these records. NCEI maintains and extends CDRs by applying the same methods to present-day and future satellite measurements.

    Terrestrial CDRs are composed of sensor data that have been improved and quality controlled over time, together with ancillary calibration data.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
Organization logoOrganization logo

Registry of Open Data on AWS

Explore at:
Dataset updated
Aug 13, 2021
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
Amazon Web Serviceshttps://aws.amazon.com/
License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

Search
Clear search
Close search
Google apps
Main menu