100+ datasets found

YouTube 8 Million - Data Lakehouse Ready
registry.opendata.aws
Updated Feb 17, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2022). YouTube 8 Million - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/yt8m/
Explore at:
Dataset updated
Feb 17, 2022
Dataset provided by
Amazon Web Serviceshttps://aws.amazon.com/
Amazon Web Serviceshttp://aws.amazon.com/
Area covered
YouTube
Description
This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.
w
Amazon Web Services - Public Data Sets
data.wu.ac.at
Updated Oct 10, 2013
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
Explore at:
Dataset updated
Oct 10, 2013
Dataset provided by
Global
Description
About

From website:

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
P
AWS Documentation Dataset
paperswithcode.com
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sia Gholami; Mehdi Noori, AWS Documentation Dataset [Dataset]. https://paperswithcode.com/dataset/aws-documentation
Explore at:
Authors
Sia Gholami; Mehdi Noori
Description
We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none depending on the type of question. For example the question: “Can I stop a DB instance that has a read replica?” has a clear yes or no answer but the question “What is the maximum number of rows in a dataset in Amazon Forecast?” is not a yes or no question and therefore has a “None” as the YNN answer. 23 questions have ‘Yes’ YNN answers, 10 questions have ‘No’ YNN answers and 67 questions have ‘None’ YNN answers.
2021 Amazon Last Mile Routing Research Challenge Dataset
registry.opendata.aws
Updated Sep 16, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). 2021 Amazon Last Mile Routing Research Challenge Dataset [Dataset]. https://registry.opendata.aws/amazon-last-mile-challenges/
Explore at:
Dataset updated
Sep 16, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in route planning, building on recent advances in predictive modeling, and using a real-world problem and data. The dataset released for the research challenge includes route-, stop-, and package-level features for 9,184 historical routes performed by Amazon drivers in 2018 in five metropolitan areas in the United States. This real-world dataset excludes any personally identifiable information (all route and package identifiers have been randomly regenerated and related location data have been obfuscated to ensure anonymity). Although multiple synthetic benchmark datasets are available in the literature, the dataset of the 2021 Amazon Last Mile Routing Research Challenge is the first large and publicly available dataset to include instances based on real-world operational routing data. The dataset is fully described and formally introduced in the following Transportation Science article: https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173
Amazon Bin Image Dataset
registry.opendata.aws
Updated Apr 20, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2018). Amazon Bin Image Dataset [Dataset]. https://registry.opendata.aws/amazon-bin-imagery/
Explore at:
Dataset updated
Apr 20, 2018
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
Description
The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.
AWS Spot Price History
zenodo.org
bin
Updated Dec 1, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Pauley; Eric Pauley (2024). AWS Spot Price History [Dataset]. http://doi.org/10.5281/zenodo.14198918
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14198918
Dataset updated
Dec 1, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eric Pauley; Eric Pauley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AWS Spot Price History

This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.

Data format

Each month of data is stored as a ZStandard-compressed `.tsv.zst` file in the `prices` folder.

The data format matches that returned by AWS's `describe-spot-instance-prices`, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:

euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00

When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:

euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00

For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.

Using data

You can process each month individually. If you need the entire data stream at once, you can cat all files to `zst` together:

cat prices/*/*.tsv.zst | zstd -d
Product Comparison Dataset for Online Shopping
registry.opendata.aws
Updated Jun 20, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2023). Product Comparison Dataset for Online Shopping [Dataset]. https://registry.opendata.aws/prod-comp-shopping/
Explore at:
Dataset updated
Jun 20, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
The Product Comparison dataset for online shopping is a new, manually annotated dataset with about 15K human generated sentences, which compare related products based on one or more of their attributes (the first such data we know of for product comparison). It covers ∼8K product sets, their selected attributes, and comparison texts.
AWS All-Regions Ping dataset
zenodo.org
application/gzip +1
Updated Apr 10, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Valerio SCHIAVONI; Valerio SCHIAVONI (2025). AWS All-Regions Ping dataset [Dataset]. http://doi.org/10.5281/zenodo.11457020
Explore at:
application/gzip, jsonAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.11457020
Dataset updated
Apr 10, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Valerio SCHIAVONI; Valerio SCHIAVONI
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This archive provides daily measurements taken by a public service (cloudping) between all pairs of AWS regions. The collection started on January 2023 and ended in April 2024. A gap in the dataset is also present.

Samples are in json format. An example of the content of the dataset is given in the file cloudping_20240603_190001.json

The dataset has 10142 hourly snapshots.
MultiCoNER Datasets
registry.opendata.aws
Updated Mar 26, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). MultiCoNER Datasets [Dataset]. https://registry.opendata.aws/multiconer/
Explore at:
Dataset updated
Mar 26, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.
AWS Spot Price History
zenodo.org
bin
Updated Dec 9, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Eric Pauley; Eric Pauley (2024). AWS Spot Price History [Dataset]. http://doi.org/10.5281/zenodo.14254124
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14254124
Dataset updated
Dec 9, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Eric Pauley; Eric Pauley
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
AWS Spot Price History

This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.

Data format

Each month of data is stored as a ZStandard-compressed .tsv.zst file.

The data format matches that returned by AWS's describe-spot-instance-prices, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:

euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00

When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:

euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00

For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.

Using data

You can process each month individually. If you need the entire data stream at once, you can cat all files to zst together:

cat prices/*/*.tsv.zst | zstd -d
Amazon-PQA
registry.opendata.aws
paperswithcode.com
+1more
Updated May 14, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2021). Amazon-PQA [Dataset]. https://registry.opendata.aws/amazon-pqa/
Explore at:
Dataset updated
May 14, 2021
Dataset provided by
Amazon.comhttp://amazon.com/
Description
Amazon product questions and their answers, along with the public product information.
Global net revenue of Amazon 2014-2024, by product group
statista.com
ai-chatbox.pro
Updated Feb 24, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Statista (2025). Global net revenue of Amazon 2014-2024, by product group [Dataset]. https://www.statista.com/statistics/672747/amazons-consolidated-net-revenue-by-segment/
Explore at:
Dataset updated
Feb 24, 2025
Dataset authored and provided by
Statistahttp://statista.com/
Area covered
Worldwide
Description
In 2024, Amazon's net revenue from subscription services segment amounted to 44.37 billion U.S. dollars. Subscription services include Amazon Prime, for which Amazon reported 200 million paying members worldwide at the end of 2020. The AWS category generated 107.56 billion U.S. dollars in annual sales. During the most recently reported fiscal year, the company’s net revenue amounted to 638 billion U.S. dollars. Amazon revenue segments Amazon is one of the biggest online companies worldwide. In 2019, the company’s revenue increased by 21 percent, compared to Google’s revenue growth during the same fiscal period, which was just 18 percent. The majority of Amazon’s net sales are generated through its North American business segment, which accounted for 236.3 billion U.S. dollars in 2020. The United States are the company’s leading market, followed by Germany and the United Kingdom. Business segment: Amazon Web Services Amazon Web Services, commonly referred to as AWS, is one of the strongest-growing business segments of Amazon. AWS is a cloud computing service that provides individuals, companies and governments with a wide range of computing, networking, storage, database, analytics and application services, among many others. As of the third quarter of 2020, AWS accounted for approximately 32 percent of the global cloud infrastructure services vendor market.
o
Travellers Information (RT)
stibmivb.aws-ec2-eu-1.opendatasoft.com
data.stib-mivb.brussels
+1more
csv, excel, json
Updated Aug 26, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
(2021). Travellers Information (RT) [Dataset]. https://stibmivb.aws-ec2-eu-1.opendatasoft.com/explore/dataset/travellers-information-rt-production/analyze/
Explore at:
json, csv, excelAvailable download formats
Dataset updated
Aug 26, 2021
Description
The operation returns messages related to the stops of all lines. This dataset contains:Traffic information about planned works (for example: planned engineering works)Event (for example: European summit, etc.)Unforeseen real-time disruptions (for example: disruption because of an accident) Important corporate messages (for example: STIB-MIVB recruiting event such as job day)In case of real-time disruptions, there will be a second message when the interruption is finished and the line is working normally again. These messages do not contain dates in the text itself.The data are refreshed every 20 seconds.
NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive
rda.ucar.edu
data.ucar.edu
+3more
Updated Jan 26, 2015
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce (2015). NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive [Dataset]. http://doi.org/10.5065/D65D8PWK
Explore at:
Unique identifier
https://doi.org/10.5065/D65D8PWK
Dataset updated
Jan 26, 2015
Dataset provided by
University Corporation for Atmospheric Research
Authors
National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce
Time period covered
Jan 15, 2015 - Jul 14, 2025
Area covered
Earth
Description
The NCEP operational Global Forecast System analysis and forecast grids are on a 0.25 by 0.25 global latitude longitude grid. Grids include analysis and forecast time steps at a 3 hourly interval from 0 to 240, and a 12 hourly interval from 240 to 384. Model forecast runs occur at 00, 06, 12, and 18 UTC daily. For real-time data access please use the NCEP data server [http://www.nco.ncep.noaa.gov/pmb/products/gfs/].

NOTE: This dataset now has a direct, continuously updating copy located on AWS (https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html [https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html]). Therefore, the RDA will stop updating this dataset in early 2025
PersonPath22
registry.opendata.aws
paperswithcode.com
Updated Sep 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2022). PersonPath22 [Dataset]. https://registry.opendata.aws/person-path-22/
Explore at:
Dataset updated
Sep 23, 2022
Dataset provided by
Amazon Web Serviceshttps://aws.amazon.com/
Amazon Web Serviceshttp://aws.amazon.com/
License
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
Description
PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.
Multi Token Completion
registry.opendata.aws
Updated Feb 11, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
Explore at:
Dataset updated
Feb 11, 2023
Dataset provided by
Amazon.comhttp://amazon.com/
License
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
Description
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
DICOM converted Slide Microscopy images for the TCGA-READ collection
zenodo.org
bin
Updated Aug 20, 2024
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim (2024). DICOM converted Slide Microscopy images for the TCGA-READ collection [Dataset]. http://doi.org/10.5281/zenodo.12689999
Explore at:
binAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.12689999
Dataset updated
Aug 20, 2024
Dataset provided by
Zenodohttp://zenodo.org/
Authors
David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim
License
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
Description
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-READ. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

Collection description

The Cancer Genome Atlas-Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to enhance the TCGA http://cancergenome.nih.gov/ data set with characterized radiological images. The Cancer Imaging Program (CIP), with the cooperation of several TCGA tissue-contributing institutions, has archived a large portion of the radiological images of the genetically-analyzed READ cases.

Please see the TCGA-READ wiki page to learn more about the images and to obtain any supporting metadata for this collection.

Files included

A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

tcga_read-idc_v8-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services buckets

tcga_read-idc_v8-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage buckets

tcga_read-idc_v8-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

Download instructions

Each of the manifests include instructions in the header on how to download the included files.

To download the files using .s5cmd manifests:

install idc-index package: pip install --upgrade idc-index

download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd.

To download the files using .dcf manifest, see manifest header.

Acknowledgments

Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

References

[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
o
DOE's Water Power Technology Office's (WPTO) US Wave dataset
registry.opendata.aws
Updated Jun 18, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
National Renewable Energy Laboratory (2020). DOE's Water Power Technology Office's (WPTO) US Wave dataset [Dataset]. https://registry.opendata.aws/wpto-pds-us-wave/
Explore at:
Dataset updated
Jun 18, 2020
Dataset provided by
<a href="https://www.nrel.gov/">National Renewable Energy Laboratory</a>
Description
Released to the public as part of the Department of Energy's Open Energy Data Initiative, this is the highest resolution publicly available long-term wave hindcast dataset that – when complete – will cover the entire U.S. Exclusive Economic Zone (EEZ).
Pan-Cancer-Nuclei-Seg-DICOM: DICOM converted Dataset of Segmented Nuclei in...
zenodo.org
explore.openaire.eu
bin, csv
Updated Feb 7, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Christopher Bridge; Markus Herrmann; David Clunie; David Clunie; Andrey Fedorov; Andrey Fedorov; Christopher Bridge; Markus Herrmann (2025). Pan-Cancer-Nuclei-Seg-DICOM: DICOM converted Dataset of Segmented Nuclei in Hematoxylin and Eosin Stained Histopathology Images [Dataset]. http://doi.org/10.5281/zenodo.14009675
Explore at:
bin, csvAvailable download formats
Unique identifier
https://doi.org/10.5281/zenodo.14009675
Dataset updated
Feb 7, 2025
Dataset provided by
Zenodohttp://zenodo.org/
Authors
Christopher Bridge; Markus Herrmann; David Clunie; David Clunie; Andrey Fedorov; Andrey Fedorov; Christopher Bridge; Markus Herrmann
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: Pan-Cancer-Nuclei-Seg-DICOM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

Collection description

This collection contains automatic nucleus segmentation data of 5,060 whole slide tissue images of 10 cancer types earlier published in [2] (https://doi.org/10.7937/TCIA.2019.4A4DKP9U) stored in DICOM Bulk Annotation and DICOM Segmentation formats.

DICOM Bulk Annotation nuclei annotations are stored as closed polygons along with the area of each nuclei. DICOM Segmentation version contains binary segmentations obtained by rasterizing the polygon contours.

The annotations correspond to digital pathology images from the TCGA-BLCA,TCGA-BRCA,TCGA-CESC,TCGA-COAD,TCGA-GBM,TCGA-LUAD,TCGA-LUSC,TCGA-PAAD,TCGA-PRAD,TCGA-READ,TCGA-SKCM,TCGA-STAD,TCGA-UCEC,TCGA-UVM collections available in NCI Imaging Data Commons.

To learn how these files are organized and how to access the content programmatically, see this documentation page: https://highdicom.readthedocs.io/en/latest/ann.html.

Conversion of the nuclei segmentations from the original format into DICOM ANN and SEG representations was done using the code available in 10.5281/zenodo.10632181.

Annotations corresponding to this container ID in the source failed to convert due to the pixel matrix being too large to store: TCGA-OL-A66K-01Z-00-DX1

The following container IDs from the source annotations have failed due to inability to find the annotated images using the container IDs:

TCGA-CU-A3QU-01Z-00-DX1 TCGA-A2-A0D1-01Z-00-DX1 TCGA-AQ-A1H2-01Z-00-DX1 TCGA-AQ-A1H2-01Z-00-DX1 TCGA-AQ-A1H3-01Z-00-DX1 TCGA-AQ-A1H3-01Z-00-DX1 TCGA-BH-A0B2-01Z-00-DX1 TCGA-E2-A15E-01Z-00-DX1 TCGA-E2-A1IP-01Z-00-DX1 TCGA-F4-6857-01Z-00-DX1 TCGA-12-0773-01Z-00-DX4 TCGA-35-3621-01Z-00-DX1 TCGA-49-4486-01Z-00-DX1 TCGA-33-4587-01Z-00-DX1 TCGA-D9-A1X3-01Z-00-DX1 TCGA-D9-A1X3-01Z-00-DX2 TCGA-D9-A4Z6-01Z-00-DX1 TCGA-EE-A17Y-01Z-00-DX1 TCGA-EE-A29R-01Z-00-DX1 TCGA-EE-A2A0-01Z-00-DX1 TCGA-EE-A2MS-01Z-00-DX1 TCGA-ER-A199-01Z-00-DX1 TCGA-ER-A1A1-01Z-00-DX1 TCGA-ER-A2NC-01Z-00-DX1 TCGA-FS-A1Z7-06Z-00-DX10 TCGA-FS-A1Z7-06Z-00-DX11 TCGA-FS-A1Z7-06Z-00-DX12 TCGA-FS-A1Z7-06Z-00-DX13 TCGA-FS-A1ZN-01Z-00-DX10 TCGA-FS-A1ZN-01Z-00-DX11 TCGA-FS-A1ZW-06Z-00-DX10 TCGA-FS-A1ZW-06Z-00-DX11 TCGA-GN-A261-01Z-00-DX1 TCGA-GN-A266-01Z-00-DX1 TCGA-GN-A268-01Z-00-DX1 TCGA-GN-A26A-01Z-00-DX1 TCGA-XV-AB01-01Z-00-DX1 TCGA-AJ-A23O-01Z-00-DX1 TCGA-AP-A056-01Z-00-DX1 TCGA-BK-A139-01Z-00-DX1 TCGA-E6-A1M0-01Z-00-DX1

Files included

A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, pan_cancer_nuclei_seg_dicom-collection_id-idc_v19-aws.s5cmd corresponds to the annotations for th eimages in the collection_id collection introduced in IDC data release v19. DICOM Binary segmentations were introduced in IDC v20. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

For each of the collections, the following manifest files are provided:

pan_cancer_nuclei_seg_dicom-: manifest of files available for download from public IDC Amazon Web Services buckets

pan_cancer_nuclei_seg_dicom-: manifest of files available for download from public IDC Google Cloud Storage buckets

pan_cancer_nuclei_seg_dicom-: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

Download instructions

Each of the manifests include instructions in the header on how to download the included files.

To download the files using .s5cmd manifests:

install idc-index package: pip install --upgrade idc-index

download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd

To download the files using .dcf manifest, see manifest header.

Acknowledgments

Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

References

[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).

[2] Hou, L., Gupta, R., Van Arnam, J. S., Zhang, Y., Sivalenka, K., Samaras, D., Kurc, T., & Saltz, J. H. (2019). Dataset of Segmented Nuclei in Hematoxylin and Eosin Stained Histopathology Images of 10 Cancer Types [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.4A4DKP9U
o
Global Database of Events, Language and Tone (GDELT)
registry.opendata.aws
Updated Apr 19, 2018
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Unmanaged (2018). Global Database of Events, Language and Tone (GDELT) [Dataset]. https://registry.opendata.aws/gdelt/
Explore at:
Dataset updated
Apr 19, 2018
Dataset provided by
Unmanaged
Description
This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amazon Web Services (2022). YouTube 8 Million - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/yt8m/

YouTube 8 Million - Data Lakehouse Ready

Explore at:

Dataset updated

Feb 17, 2022

Dataset provided by

Amazon Web Serviceshttps://aws.amazon.com/
Amazon Web Serviceshttp://aws.amazon.com/

Area covered

YouTube

Description

This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.

Clear search

Close search

Google apps

Main menu

YouTube 8 Million - Data Lakehouse Ready

Amazon Web Services - Public Data Sets

About

AWS Documentation Dataset

2021 Amazon Last Mile Routing Research Challenge Dataset

Amazon Bin Image Dataset

AWS Spot Price History

AWS Spot Price History

Data format

Using data

Product Comparison Dataset for Online Shopping

AWS All-Regions Ping dataset

MultiCoNER Datasets

AWS Spot Price History

AWS Spot Price History

Data format

Using data

Amazon-PQA

Global net revenue of Amazon 2014-2024, by product group

Travellers Information (RT)

NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive

PersonPath22

Multi Token Completion

DICOM converted Slide Microscopy images for the TCGA-READ collection

Collection description

Files included

Download instructions

Acknowledgments

References

DOE's Water Power Technology Office's (WPTO) US Wave dataset

Pan-Cancer-Nuclei-Seg-DICOM: DICOM converted Dataset of Segmented Nuclei in...

Collection description

Files included

Download instructions

Acknowledgments

References

Global Database of Events, Language and Tone (GDELT)

YouTube 8 Million - Data Lakehouse Ready