This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.
From website:
Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none depending on the type of question. For example the question: “Can I stop a DB instance that has a read replica?” has a clear yes or no answer but the question “What is the maximum number of rows in a dataset in Amazon Forecast?” is not a yes or no question and therefore has a “None” as the YNN answer. 23 questions have ‘Yes’ YNN answers, 10 questions have ‘No’ YNN answers and 67 questions have ‘None’ YNN answers.
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in route planning, building on recent advances in predictive modeling, and using a real-world problem and data. The dataset released for the research challenge includes route-, stop-, and package-level features for 9,184 historical routes performed by Amazon drivers in 2018 in five metropolitan areas in the United States. This real-world dataset excludes any personally identifiable information (all route and package identifiers have been randomly regenerated and related location data have been obfuscated to ensure anonymity). Although multiple synthetic benchmark datasets are available in the literature, the dataset of the 2021 Amazon Last Mile Routing Research Challenge is the first large and publicly available dataset to include instances based on real-world operational routing data. The dataset is fully described and formally introduced in the following Transportation Science article: https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173
Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
License information was derived automatically
The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.
Each month of data is stored as a ZStandard-compressed `.tsv.zst` file in the `prices` folder.
The data format matches that returned by AWS's `describe-spot-instance-prices`, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:
euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00
When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:
euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00
For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.
You can process each month individually. If you need the entire data stream at once, you can cat all files to `zst` together:
cat prices/*/*.tsv.zst | zstd -d
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
The Product Comparison dataset for online shopping is a new, manually annotated dataset with about 15K human generated sentences, which compare related products based on one or more of their attributes (the first such data we know of for product comparison). It covers ∼8K product sets, their selected attributes, and comparison texts.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This archive provides daily measurements taken by a public service (cloudping) between all pairs of AWS regions. The collection started on January 2023 and ended in April 2024. A gap in the dataset is also present.
Samples are in json format. An example of the content of the dataset is given in the file cloudping_20240603_190001.json
The dataset has 10142 hourly snapshots.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.
Each month of data is stored as a ZStandard-compressed .tsv.zst
file.
The data format matches that returned by AWS's describe-spot-instance-prices
, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:
euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00
When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:
euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00
For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.
You can process each month individually. If you need the entire data stream at once, you can cat all files to zst
together:
cat prices/*/*.tsv.zst | zstd -d
Amazon product questions and their answers, along with the public product information.
In 2024, Amazon's net revenue from subscription services segment amounted to 44.37 billion U.S. dollars. Subscription services include Amazon Prime, for which Amazon reported 200 million paying members worldwide at the end of 2020. The AWS category generated 107.56 billion U.S. dollars in annual sales. During the most recently reported fiscal year, the company’s net revenue amounted to 638 billion U.S. dollars. Amazon revenue segments Amazon is one of the biggest online companies worldwide. In 2019, the company’s revenue increased by 21 percent, compared to Google’s revenue growth during the same fiscal period, which was just 18 percent. The majority of Amazon’s net sales are generated through its North American business segment, which accounted for 236.3 billion U.S. dollars in 2020. The United States are the company’s leading market, followed by Germany and the United Kingdom. Business segment: Amazon Web Services Amazon Web Services, commonly referred to as AWS, is one of the strongest-growing business segments of Amazon. AWS is a cloud computing service that provides individuals, companies and governments with a wide range of computing, networking, storage, database, analytics and application services, among many others. As of the third quarter of 2020, AWS accounted for approximately 32 percent of the global cloud infrastructure services vendor market.
The operation returns messages related to the stops of all lines. This dataset contains:Traffic information about planned works (for example: planned engineering works)Event (for example: European summit, etc.)Unforeseen real-time disruptions (for example: disruption because of an accident) Important corporate messages (for example: STIB-MIVB recruiting event such as job day)In case of real-time disruptions, there will be a second message when the interruption is finished and the line is working normally again. These messages do not contain dates in the text itself.The data are refreshed every 20 seconds.
The NCEP operational Global Forecast System analysis and forecast grids are on a 0.25 by 0.25 global latitude longitude grid. Grids include analysis and forecast time steps at a 3 hourly interval from 0 to 240, and a 12 hourly interval from 240 to 384. Model forecast runs occur at 00, 06, 12, and 18 UTC daily. For real-time data access please use the NCEP data server [http://www.nco.ncep.noaa.gov/pmb/products/gfs/].
NOTE: This dataset now has a direct, continuously updating copy located on AWS (https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html [https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html]). Therefore, the RDA will stop updating this dataset in early 2025
Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
License information was derived automatically
PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.
Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
License information was derived automatically
This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.
Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-READ. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
The Cancer Genome Atlas-Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to enhance the TCGA http://cancergenome.nih.gov/ data set with characterized radiological images. The Cancer Imaging Program (CIP), with the cooperation of several TCGA tissue-contributing institutions, has archived a large portion of the radiological images of the genetically-analyzed READ cases.
Please see the TCGA-READ wiki page to learn more about the images and to obtain any supporting metadata for this collection.
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced.
For example, collection_id-idc_v8-aws.s5cmd
corresponds to the contents of the
collection_id
collection introduced in IDC data
release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of
the corresponding collection was introduced.
tcga_read-idc_v8-aws.s5cmd
: manifest of files available for download from public IDC Amazon Web Services bucketstcga_read-idc_v8-gcs.s5cmd
: manifest of files available for download from public IDC Google Cloud Storage bucketstcga_read-idc_v8-dcf.dcf
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference
files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
.To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
[1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180
Released to the public as part of the Department of Energy's Open Energy Data Initiative, this is the highest resolution publicly available long-term wave hindcast dataset that – when complete – will cover the entire U.S. Exclusive Economic Zone (EEZ).
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: Pan-Cancer-Nuclei-Seg-DICOM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.
TCGA-OL-A66K-01Z-00-DX1
TCGA-CU-A3QU-01Z-00-DX1
TCGA-A2-A0D1-01Z-00-DX1
TCGA-AQ-A1H2-01Z-00-DX1
TCGA-AQ-A1H2-01Z-00-DX1
TCGA-AQ-A1H3-01Z-00-DX1
TCGA-AQ-A1H3-01Z-00-DX1
TCGA-BH-A0B2-01Z-00-DX1
TCGA-E2-A15E-01Z-00-DX1
TCGA-E2-A1IP-01Z-00-DX1
TCGA-F4-6857-01Z-00-DX1
TCGA-12-0773-01Z-00-DX4
TCGA-35-3621-01Z-00-DX1
TCGA-49-4486-01Z-00-DX1
TCGA-33-4587-01Z-00-DX1
TCGA-D9-A1X3-01Z-00-DX1
TCGA-D9-A1X3-01Z-00-DX2
TCGA-D9-A4Z6-01Z-00-DX1
TCGA-EE-A17Y-01Z-00-DX1
TCGA-EE-A29R-01Z-00-DX1
TCGA-EE-A2A0-01Z-00-DX1
TCGA-EE-A2MS-01Z-00-DX1
TCGA-ER-A199-01Z-00-DX1
TCGA-ER-A1A1-01Z-00-DX1
TCGA-ER-A2NC-01Z-00-DX1
TCGA-FS-A1Z7-06Z-00-DX10
TCGA-FS-A1Z7-06Z-00-DX11
TCGA-FS-A1Z7-06Z-00-DX12
TCGA-FS-A1Z7-06Z-00-DX13
TCGA-FS-A1ZN-01Z-00-DX10
TCGA-FS-A1ZN-01Z-00-DX11
TCGA-FS-A1ZW-06Z-00-DX10
TCGA-FS-A1ZW-06Z-00-DX11
TCGA-GN-A261-01Z-00-DX1
TCGA-GN-A266-01Z-00-DX1
TCGA-GN-A268-01Z-00-DX1
TCGA-GN-A26A-01Z-00-DX1
TCGA-XV-AB01-01Z-00-DX1
TCGA-AJ-A23O-01Z-00-DX1
TCGA-AP-A056-01Z-00-DX1
TCGA-BK-A139-01Z-00-DX1
TCGA-E6-A1M0-01Z-00-DX1
A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, pan_cancer_nuclei_seg_dicom-collection_id-idc_v19-aws.s5cmd
corresponds to the annotations for th eimages in the collection_id
collection introduced in IDC data release v19. DICOM Binary segmentations were introduced in IDC v20. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.
For each of the collections, the following manifest files are provided:
pan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Amazon Web Services bucketspan_cancer_nuclei_seg_dicom-
: manifest of files available for download from public IDC Google Cloud Storage bucketspan_cancer_nuclei_seg_dicom-
: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)Note that manifest files that end in -aws.s5cmd
reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd
reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.
Each of the manifests include instructions in the header on how to download the included files.
To download the files using .s5cmd
manifests:
pip install --upgrade idc-index
.s5cmd
manifest file: idc download manifest.s5cmd
To download the files using .dcf
manifest, see manifest header.
Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.
This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.
This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.