100+ datasets found
  1. YouTube 8 Million - Data Lakehouse Ready

    • registry.opendata.aws
    Updated Feb 17, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services (2022). YouTube 8 Million - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/yt8m/
    Explore at:
    Dataset updated
    Feb 17, 2022
    Dataset provided by
    Amazon Web Serviceshttps://aws.amazon.com/
    Amazon Web Serviceshttp://aws.amazon.com/
    Area covered
    YouTube
    Description

    This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.

  2. w

    Amazon Web Services - Public Data Sets

    • data.wu.ac.at
    Updated Oct 10, 2013
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Global (2013). Amazon Web Services - Public Data Sets [Dataset]. https://data.wu.ac.at/schema/datahub_io/NTYxNjkxNmYtNmZlNS00N2EwLWJkYTktZjFjZWJkNTM2MTNm
    Explore at:
    Dataset updated
    Oct 10, 2013
    Dataset provided by
    Global
    Description

    About

    From website:

    Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

    Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

  3. P

    AWS Documentation Dataset

    • paperswithcode.com
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Sia Gholami; Mehdi Noori, AWS Documentation Dataset [Dataset]. https://paperswithcode.com/dataset/aws-documentation
    Explore at:
    Authors
    Sia Gholami; Mehdi Noori
    Description

    We present the AWS documentation corpus, an open-book QA dataset, which contains 25,175 documents along with 100 matched questions and answers. These questions are inspired by the author's interactions with real AWS customers and the questions they asked about AWS services. The data was anonymized and aggregated. All questions in the dataset have a valid, factual and unambiguous answer within the accompanying documents, we deliberately avoided questions that are ambiguous, incomprehensible, opinion-seeking, or not clearly a request for factual information. All questions, answers and accompanying documents in the dataset are annotated by authors. There are two types of answers: text and yes-no-none(YNN) answers. Text answers range from a few words to a full paragraph sourced from a continuous block of words in a document or from different locations within the same document. Every question in the dataset has a matched text answer. Yes-no-none(YNN) answers can be yes, no, or none depending on the type of question. For example the question: “Can I stop a DB instance that has a read replica?” has a clear yes or no answer but the question “What is the maximum number of rows in a dataset in Amazon Forecast?” is not a yes or no question and therefore has a “None” as the YNN answer. 23 questions have ‘Yes’ YNN answers, 10 questions have ‘No’ YNN answers and 67 questions have ‘None’ YNN answers.

  4. 2021 Amazon Last Mile Routing Research Challenge Dataset

    • registry.opendata.aws
    Updated Sep 16, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). 2021 Amazon Last Mile Routing Research Challenge Dataset [Dataset]. https://registry.opendata.aws/amazon-last-mile-challenges/
    Explore at:
    Dataset updated
    Sep 16, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    The 2021 Amazon Last Mile Routing Research Challenge was an innovative research initiative led by Amazon.com and supported by the Massachusetts Institute of Technology’s Center for Transportation and Logistics. Over a period of 4 months, participants were challenged to develop innovative machine learning-based methods to enhance classic optimization-based approaches to solve the travelling salesperson problem, by learning from historical routes executed by Amazon delivery drivers. The primary goal of the Amazon Last Mile Routing Research Challenge was to foster innovative applied research in route planning, building on recent advances in predictive modeling, and using a real-world problem and data. The dataset released for the research challenge includes route-, stop-, and package-level features for 9,184 historical routes performed by Amazon drivers in 2018 in five metropolitan areas in the United States. This real-world dataset excludes any personally identifiable information (all route and package identifiers have been randomly regenerated and related location data have been obfuscated to ensure anonymity). Although multiple synthetic benchmark datasets are available in the literature, the dataset of the 2021 Amazon Last Mile Routing Research Challenge is the first large and publicly available dataset to include instances based on real-world operational routing data. The dataset is fully described and formally introduced in the following Transportation Science article: https://pubsonline.informs.org/doi/10.1287/trsc.2022.1173

  5. Amazon Bin Image Dataset

    • registry.opendata.aws
    Updated Apr 20, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2018). Amazon Bin Image Dataset [Dataset]. https://registry.opendata.aws/amazon-bin-imagery/
    Explore at:
    Dataset updated
    Apr 20, 2018
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-NonCommercial-ShareAlike 3.0 (CC BY-NC-SA 3.0)https://creativecommons.org/licenses/by-nc-sa/3.0/
    License information was derived automatically

    Description

    The Amazon Bin Image Dataset contains over 500,000 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations.

  6. AWS Spot Price History

    • zenodo.org
    bin
    Updated Dec 1, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Pauley; Eric Pauley (2024). AWS Spot Price History [Dataset]. http://doi.org/10.5281/zenodo.14198918
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 1, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eric Pauley; Eric Pauley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AWS Spot Price History

    This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.

    Data format

    Each month of data is stored as a ZStandard-compressed `.tsv.zst` file in the `prices` folder.

    The data format matches that returned by AWS's `describe-spot-instance-prices`, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:

    euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
    euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
    euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
    euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
    euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
    euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
    euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00

    When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:

    euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00

    For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.

    Using data

    You can process each month individually. If you need the entire data stream at once, you can cat all files to `zst` together:

    cat prices/*/*.tsv.zst | zstd -d

  7. Product Comparison Dataset for Online Shopping

    • registry.opendata.aws
    Updated Jun 20, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2023). Product Comparison Dataset for Online Shopping [Dataset]. https://registry.opendata.aws/prod-comp-shopping/
    Explore at:
    Dataset updated
    Jun 20, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    The Product Comparison dataset for online shopping is a new, manually annotated dataset with about 15K human generated sentences, which compare related products based on one or more of their attributes (the first such data we know of for product comparison). It covers ∼8K product sets, their selected attributes, and comparison texts.

  8. AWS All-Regions Ping dataset

    • zenodo.org
    application/gzip +1
    Updated Apr 10, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Valerio SCHIAVONI; Valerio SCHIAVONI (2025). AWS All-Regions Ping dataset [Dataset]. http://doi.org/10.5281/zenodo.11457020
    Explore at:
    application/gzip, jsonAvailable download formats
    Dataset updated
    Apr 10, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Valerio SCHIAVONI; Valerio SCHIAVONI
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This archive provides daily measurements taken by a public service (cloudping) between all pairs of AWS regions. The collection started on January 2023 and ended in April 2024. A gap in the dataset is also present.

    Samples are in json format. An example of the content of the dataset is given in the file cloudping_20240603_190001.json

    The dataset has 10142 hourly snapshots.

  9. MultiCoNER Datasets

    • registry.opendata.aws
    Updated Mar 26, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2022). MultiCoNER Datasets [Dataset]. https://registry.opendata.aws/multiconer/
    Explore at:
    Dataset updated
    Mar 26, 2022
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    MultiCoNER 1 is a large multilingual dataset (11 languages) for Named Entity Recognition. It is designed to represent some of the contemporary challenges in NER, including low-context scenarios (short and uncased text), syntactically complex entities such as movie titles, and long-tail entity distributions. MultiCoNER 2 is a large multilingual dataset (12 languages) for fine grained Named Entity Recognition. Its fine-grained taxonomy contains 36 NE classes, representing real-world challenges for NER, where named entities, apart from the surface form, context represents a critical role in distinguishing between the different fine-grained types (e.g. Scientist vs. Athlete). Furthermore, the test data of MultiCoNER 2 contains noisy instances, where the noise has been applied to both context tokens as well as the entity tokens. The noise includes typing errors at character level based on keyboard layouts in the the different languages.

  10. AWS Spot Price History

    • zenodo.org
    bin
    Updated Dec 9, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Eric Pauley; Eric Pauley (2024). AWS Spot Price History [Dataset]. http://doi.org/10.5281/zenodo.14254124
    Explore at:
    binAvailable download formats
    Dataset updated
    Dec 9, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Eric Pauley; Eric Pauley
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    AWS Spot Price History

    This dataset tracks historical prices for AWS spot prices across all regions. It is updated automatically on the 1st of each month to contain data from the previous month.

    Data format

    Each month of data is stored as a ZStandard-compressed .tsv.zst file.

    The data format matches that returned by AWS's describe-spot-instance-prices, with the exception that availability zones have been replaced by their global ID. For instance, here are some example lines from one capture:

    euc1-az2 i4i.8xlarge Linux/UNIX 1.231800 2023-02-28T23:59:57+00:00
    euc1-az3 r5b.8xlarge Red Hat Enterprise Linux 0.749600 2023-02-28T23:59:58+00:00
    euc1-az3 r5b.8xlarge SUSE Linux 0.744600 2023-02-28T23:59:58+00:00
    euc1-az3 r5b.8xlarge Linux/UNIX 0.619600 2023-02-28T23:59:58+00:00
    euc1-az3 m5n.4xlarge Red Hat Enterprise Linux 0.476000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge Red Hat Enterprise Linux 0.492000 2023-02-28T23:59:59+00:00
    euc1-az3 m5n.4xlarge SUSE Linux 0.471000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge SUSE Linux 0.487000 2023-02-28T23:59:59+00:00
    euc1-az3 m5n.4xlarge Linux/UNIX 0.346000 2023-02-28T23:59:59+00:00
    euc1-az2 m5n.4xlarge Linux/UNIX 0.362000 2023-02-28T23:59:59+00:00

    When fetching spot instance pricing from AWS, results contain some prices from the previous month so that the price is known at the start of the month. These prices are adjusted in this dataset to be at the exact start of the month UTC:

    euw3-az2 g4dn.4xlarge Linux/UNIX 0.558600 2023-01-01T00:00:00+00:00

    For data from 2023-01 and before, this data was fetched more than one month at a time. This should have no negative impact unless, for example, an instance type was retired before the month began (and there should therefore be no price). These older files also only contain default regions. Data from 2023-02 and later contains all regions, including opt-in regions.

    Using data

    You can process each month individually. If you need the entire data stream at once, you can cat all files to zst together:

    cat prices/*/*.tsv.zst | zstd -d

  11. Amazon-PQA

    • registry.opendata.aws
    • paperswithcode.com
    • +1more
    Updated May 14, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2021). Amazon-PQA [Dataset]. https://registry.opendata.aws/amazon-pqa/
    Explore at:
    Dataset updated
    May 14, 2021
    Dataset provided by
    Amazon.comhttp://amazon.com/
    Description

    Amazon product questions and their answers, along with the public product information.

  12. Global net revenue of Amazon 2014-2024, by product group

    • statista.com
    • ai-chatbox.pro
    Updated Feb 24, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Statista (2025). Global net revenue of Amazon 2014-2024, by product group [Dataset]. https://www.statista.com/statistics/672747/amazons-consolidated-net-revenue-by-segment/
    Explore at:
    Dataset updated
    Feb 24, 2025
    Dataset authored and provided by
    Statistahttp://statista.com/
    Area covered
    Worldwide
    Description

    In 2024, Amazon's net revenue from subscription services segment amounted to 44.37 billion U.S. dollars. Subscription services include Amazon Prime, for which Amazon reported 200 million paying members worldwide at the end of 2020. The AWS category generated 107.56 billion U.S. dollars in annual sales. During the most recently reported fiscal year, the company’s net revenue amounted to 638 billion U.S. dollars. Amazon revenue segments Amazon is one of the biggest online companies worldwide. In 2019, the company’s revenue increased by 21 percent, compared to Google’s revenue growth during the same fiscal period, which was just 18 percent. The majority of Amazon’s net sales are generated through its North American business segment, which accounted for 236.3 billion U.S. dollars in 2020. The United States are the company’s leading market, followed by Germany and the United Kingdom. Business segment: Amazon Web Services Amazon Web Services, commonly referred to as AWS, is one of the strongest-growing business segments of Amazon. AWS is a cloud computing service that provides individuals, companies and governments with a wide range of computing, networking, storage, database, analytics and application services, among many others. As of the third quarter of 2020, AWS accounted for approximately 32 percent of the global cloud infrastructure services vendor market.

  13. o

    Travellers Information (RT)

    • stibmivb.aws-ec2-eu-1.opendatasoft.com
    • data.stib-mivb.brussels
    • +1more
    csv, excel, json
    Updated Aug 26, 2021
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    (2021). Travellers Information (RT) [Dataset]. https://stibmivb.aws-ec2-eu-1.opendatasoft.com/explore/dataset/travellers-information-rt-production/analyze/
    Explore at:
    json, csv, excelAvailable download formats
    Dataset updated
    Aug 26, 2021
    Description

    The operation returns messages related to the stops of all lines. This dataset contains:Traffic information about planned works (for example: planned engineering works)Event (for example: European summit, etc.)Unforeseen real-time disruptions (for example: disruption because of an accident) Important corporate messages (for example: STIB-MIVB recruiting event such as job day)In case of real-time disruptions, there will be a second message when the interruption is finished and the line is working normally again. These messages do not contain dates in the text itself.The data are refreshed every 20 seconds.

  14. NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive

    • rda.ucar.edu
    • data.ucar.edu
    • +3more
    Updated Jan 26, 2015
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce (2015). NCEP GFS 0.25 Degree Global Forecast Grids Historical Archive [Dataset]. http://doi.org/10.5065/D65D8PWK
    Explore at:
    Dataset updated
    Jan 26, 2015
    Dataset provided by
    University Corporation for Atmospheric Research
    Authors
    National Centers for Environmental Prediction/National Weather Service/NOAA/U.S. Department of Commerce
    Time period covered
    Jan 15, 2015 - Jul 14, 2025
    Area covered
    Earth
    Description

    The NCEP operational Global Forecast System analysis and forecast grids are on a 0.25 by 0.25 global latitude longitude grid. Grids include analysis and forecast time steps at a 3 hourly interval from 0 to 240, and a 12 hourly interval from 240 to 384. Model forecast runs occur at 00, 06, 12, and 18 UTC daily. For real-time data access please use the NCEP data server [http://www.nco.ncep.noaa.gov/pmb/products/gfs/].

    NOTE: This dataset now has a direct, continuously updating copy located on AWS (https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html [https://noaa-gfs-bdp-pds.s3.amazonaws.com/index.html]). Therefore, the RDA will stop updating this dataset in early 2025

  15. PersonPath22

    • registry.opendata.aws
    • paperswithcode.com
    Updated Sep 23, 2022
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon Web Services (2022). PersonPath22 [Dataset]. https://registry.opendata.aws/person-path-22/
    Explore at:
    Dataset updated
    Sep 23, 2022
    Dataset provided by
    Amazon Web Serviceshttps://aws.amazon.com/
    Amazon Web Serviceshttp://aws.amazon.com/
    License

    Attribution-NonCommercial 4.0 (CC BY-NC 4.0)https://creativecommons.org/licenses/by-nc/4.0/
    License information was derived automatically

    Description

    PersonPath22 is a large-scale multi-person tracking dataset containing 236 videos captured mostly from static-mounted cameras, collected from sources where we were given the rights to redistribute the content and participants have given explicit consent. Each video has ground-truth annotations including both bounding boxes and tracklet-ids for all the persons in each frame.

  16. Multi Token Completion

    • registry.opendata.aws
    Updated Feb 11, 2023
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Amazon (2023). Multi Token Completion [Dataset]. https://registry.opendata.aws/multi-token-completion/
    Explore at:
    Dataset updated
    Feb 11, 2023
    Dataset provided by
    Amazon.comhttp://amazon.com/
    License

    Attribution-ShareAlike 4.0 (CC BY-SA 4.0)https://creativecommons.org/licenses/by-sa/4.0/
    License information was derived automatically

    Description

    This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences. We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated). For these datasets, the columns provided for each datapoint are as follows: text- the original sentence span- the span (phrase) which is masked out span_lower- the lowercase version of span range- the range in the text string which will be masked out (this is important because span might appear more than once in text) freq- the corpus frequency of span_lower masked_text- the masked version of text, span is replaced with [MASK] Additinaly, we provide a small (3K) dataset with human annotations.

  17. DICOM converted Slide Microscopy images for the TCGA-READ collection

    • zenodo.org
    bin
    Updated Aug 20, 2024
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim (2024). DICOM converted Slide Microscopy images for the TCGA-READ collection [Dataset]. http://doi.org/10.5281/zenodo.12689999
    Explore at:
    binAvailable download formats
    Dataset updated
    Aug 20, 2024
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    David Clunie; David Clunie; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim; Andrey Fedorov; Andrey Fedorov; William Clifford; David Pot; Ulrike Wagner; Keyvan Farahani; Erika Kim
    License

    Attribution 3.0 (CC BY 3.0)https://creativecommons.org/licenses/by/3.0/
    License information was derived automatically

    Description

    This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: TCGA-READ. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

    Collection description

    The Cancer Genome Atlas-Rectum Adenocarcinoma (TCGA-READ) data collection is part of a larger effort to enhance the TCGA http://cancergenome.nih.gov/ data set with characterized radiological images. The Cancer Imaging Program (CIP), with the cooperation of several TCGA tissue-contributing institutions, has archived a large portion of the radiological images of the genetically-analyzed READ cases.


    Please see the TCGA-READ wiki page to learn more about the images and to obtain any supporting metadata for this collection.

    Files included

    A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, collection_id-idc_v8-aws.s5cmd corresponds to the contents of the collection_id collection introduced in IDC data release v8. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

    1. tcga_read-idc_v8-aws.s5cmd: manifest of files available for download from public IDC Amazon Web Services buckets
    2. tcga_read-idc_v8-gcs.s5cmd: manifest of files available for download from public IDC Google Cloud Storage buckets
    3. tcga_read-idc_v8-dcf.dcf: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

    Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

    Download instructions

    Each of the manifests include instructions in the header on how to download the included files.

    To download the files using .s5cmd manifests:

    1. install idc-index package: pip install --upgrade idc-index
    2. download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd.

    To download the files using .dcf manifest, see manifest header.

    Acknowledgments

    Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

    References

    [1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence. RadioGraphics (2023). https://doi.org/10.1148/rg.230180

  18. o

    DOE's Water Power Technology Office's (WPTO) US Wave dataset

    • registry.opendata.aws
    Updated Jun 18, 2020
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    National Renewable Energy Laboratory (2020). DOE's Water Power Technology Office's (WPTO) US Wave dataset [Dataset]. https://registry.opendata.aws/wpto-pds-us-wave/
    Explore at:
    Dataset updated
    Jun 18, 2020
    Dataset provided by
    <a href="https://www.nrel.gov/">National Renewable Energy Laboratory</a>
    Description

    Released to the public as part of the Department of Energy's Open Energy Data Initiative, this is the highest resolution publicly available long-term wave hindcast dataset that – when complete – will cover the entire U.S. Exclusive Economic Zone (EEZ).

  19. Pan-Cancer-Nuclei-Seg-DICOM: DICOM converted Dataset of Segmented Nuclei in...

    • zenodo.org
    • explore.openaire.eu
    bin, csv
    Updated Feb 7, 2025
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Christopher Bridge; Markus Herrmann; David Clunie; David Clunie; Andrey Fedorov; Andrey Fedorov; Christopher Bridge; Markus Herrmann (2025). Pan-Cancer-Nuclei-Seg-DICOM: DICOM converted Dataset of Segmented Nuclei in Hematoxylin and Eosin Stained Histopathology Images [Dataset]. http://doi.org/10.5281/zenodo.14009675
    Explore at:
    bin, csvAvailable download formats
    Dataset updated
    Feb 7, 2025
    Dataset provided by
    Zenodohttp://zenodo.org/
    Authors
    Christopher Bridge; Markus Herrmann; David Clunie; David Clunie; Andrey Fedorov; Andrey Fedorov; Christopher Bridge; Markus Herrmann
    License

    Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
    License information was derived automatically

    Description

    This dataset corresponds to a collection of images and/or image-derived data available from National Cancer Institute Imaging Data Commons (IDC) [1]. This dataset was converted into DICOM representation and ingested by the IDC team. You can explore and visualize the corresponding images using IDC Portal here: Pan-Cancer-Nuclei-Seg-DICOM. You can use the manifests included in this Zenodo record to download the content of the collection following the Download instructions below.

    Collection description

    This collection contains automatic nucleus segmentation data of 5,060 whole slide tissue images of 10 cancer types earlier published in [2] (https://doi.org/10.7937/TCIA.2019.4A4DKP9U) stored in DICOM Bulk Annotation and DICOM Segmentation formats.
    DICOM Bulk Annotation nuclei annotations are stored as closed polygons along with the area of each nuclei. DICOM Segmentation version contains binary segmentations obtained by rasterizing the polygon contours.
    The annotations correspond to digital pathology images from the TCGA-BLCA,TCGA-BRCA,TCGA-CESC,TCGA-COAD,TCGA-GBM,TCGA-LUAD,TCGA-LUSC,TCGA-PAAD,TCGA-PRAD,TCGA-READ,TCGA-SKCM,TCGA-STAD,TCGA-UCEC,TCGA-UVM collections available in NCI Imaging Data Commons.
    To learn how these files are organized and how to access the content programmatically, see this documentation page: https://highdicom.readthedocs.io/en/latest/ann.html.
    Conversion of the nuclei segmentations from the original format into DICOM ANN and SEG representations was done using the code available in 10.5281/zenodo.10632181.
    Annotations corresponding to this container ID in the source failed to convert due to the pixel matrix being too large to store: TCGA-OL-A66K-01Z-00-DX1
    The following container IDs from the source annotations have failed due to inability to find the annotated images using the container IDs:
    TCGA-CU-A3QU-01Z-00-DX1
    TCGA-A2-A0D1-01Z-00-DX1
    TCGA-AQ-A1H2-01Z-00-DX1
    TCGA-AQ-A1H2-01Z-00-DX1
    TCGA-AQ-A1H3-01Z-00-DX1
    TCGA-AQ-A1H3-01Z-00-DX1
    TCGA-BH-A0B2-01Z-00-DX1
    TCGA-E2-A15E-01Z-00-DX1
    TCGA-E2-A1IP-01Z-00-DX1
    TCGA-F4-6857-01Z-00-DX1
    TCGA-12-0773-01Z-00-DX4
    TCGA-35-3621-01Z-00-DX1
    TCGA-49-4486-01Z-00-DX1
    TCGA-33-4587-01Z-00-DX1
    TCGA-D9-A1X3-01Z-00-DX1
    TCGA-D9-A1X3-01Z-00-DX2
    TCGA-D9-A4Z6-01Z-00-DX1
    TCGA-EE-A17Y-01Z-00-DX1
    TCGA-EE-A29R-01Z-00-DX1
    TCGA-EE-A2A0-01Z-00-DX1
    TCGA-EE-A2MS-01Z-00-DX1
    TCGA-ER-A199-01Z-00-DX1
    TCGA-ER-A1A1-01Z-00-DX1
    TCGA-ER-A2NC-01Z-00-DX1
    TCGA-FS-A1Z7-06Z-00-DX10
    TCGA-FS-A1Z7-06Z-00-DX11
    TCGA-FS-A1Z7-06Z-00-DX12
    TCGA-FS-A1Z7-06Z-00-DX13
    TCGA-FS-A1ZN-01Z-00-DX10
    TCGA-FS-A1ZN-01Z-00-DX11
    TCGA-FS-A1ZW-06Z-00-DX10
    TCGA-FS-A1ZW-06Z-00-DX11
    TCGA-GN-A261-01Z-00-DX1
    TCGA-GN-A266-01Z-00-DX1
    TCGA-GN-A268-01Z-00-DX1
    TCGA-GN-A26A-01Z-00-DX1
    TCGA-XV-AB01-01Z-00-DX1
    TCGA-AJ-A23O-01Z-00-DX1
    TCGA-AP-A056-01Z-00-DX1
    TCGA-BK-A139-01Z-00-DX1
    TCGA-E6-A1M0-01Z-00-DX1

    Files included

    A manifest file's name indicates the IDC data release in which a version of collection data was first introduced. For example, pan_cancer_nuclei_seg_dicom-collection_id-idc_v19-aws.s5cmd corresponds to the annotations for th eimages in the collection_id collection introduced in IDC data release v19. DICOM Binary segmentations were introduced in IDC v20. If there is a subsequent version of this Zenodo page, it will indicate when a subsequent version of the corresponding collection was introduced.

    For each of the collections, the following manifest files are provided:

    1. pan_cancer_nuclei_seg_dicom-: manifest of files available for download from public IDC Amazon Web Services buckets
    2. pan_cancer_nuclei_seg_dicom-: manifest of files available for download from public IDC Google Cloud Storage buckets
    3. pan_cancer_nuclei_seg_dicom-: Gen3 manifest (for details see https://learn.canceridc.dev/data/organization-of-data/guids-and-uuids)

    Note that manifest files that end in -aws.s5cmd reference files stored in Amazon Web Services (AWS) buckets, while -gcs.s5cmd reference files in Google Cloud Storage. The actual files are identical and are mirrored between AWS and GCP.

    Download instructions

    Each of the manifests include instructions in the header on how to download the included files.

    To download the files using .s5cmd manifests:

    1. install idc-index package: pip install --upgrade idc-index
    2. download the files referenced by manifests included in this dataset by passing the .s5cmd manifest file: idc download manifest.s5cmd

    To download the files using .dcf manifest, see manifest header.

    Acknowledgments

    Imaging Data Commons team has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

    References

    [1] Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S. D., Gibbs, D. L., Bridge, C., Herrmann, M. D., Homeyer, A., Lewis, R., Aerts, H. J. W. L., Krishnaswamy, D., Thiriveedhi, V. K., Ciausu, C., Schacherer, D. P., Bontempi, D., Pihl, T., Wagner, U., Farahani, K., Kim, E. & Kikinis, R. National cancer institute imaging data commons: Toward transparency, reproducibility, and scalability in imaging artificial intelligence. Radiographics 43, (2023).
    [2] Hou, L., Gupta, R., Van Arnam, J. S., Zhang, Y., Sivalenka, K., Samaras, D., Kurc, T., & Saltz, J. H. (2019). Dataset of Segmented Nuclei in Hematoxylin and Eosin Stained Histopathology Images of 10 Cancer Types [Data set]. The Cancer Imaging Archive. https://doi.org/10.7937/TCIA.2019.4A4DKP9U
  20. o

    Global Database of Events, Language and Tone (GDELT)

    • registry.opendata.aws
    Updated Apr 19, 2018
    + more versions
    Share
    FacebookFacebook
    TwitterTwitter
    Email
    Click to copy link
    Link copied
    Close
    Cite
    Unmanaged (2018). Global Database of Events, Language and Tone (GDELT) [Dataset]. https://registry.opendata.aws/gdelt/
    Explore at:
    Dataset updated
    Apr 19, 2018
    Dataset provided by
    Unmanaged
    Description

    This project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, counts, themes, sources, emotions, quotes, images and events driving our global society every second of every day.

Share
FacebookFacebook
TwitterTwitter
Email
Click to copy link
Link copied
Close
Cite
Amazon Web Services (2022). YouTube 8 Million - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/yt8m/
Organization logoOrganization logo

YouTube 8 Million - Data Lakehouse Ready

Explore at:
Dataset updated
Feb 17, 2022
Dataset provided by
Amazon Web Serviceshttps://aws.amazon.com/
Amazon Web Serviceshttp://aws.amazon.com/
Area covered
YouTube
Description

This both the original .tfrecords and a Parquet representation of the YouTube 8 Million dataset. YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs, with high-quality machine-generated annotations from a diverse vocabulary of 3,800+ visual entities. It comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This dataset also includes the YouTube-8M Segments data from June 2019. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue Catalog entries into your account in about 30 seconds. That one step will enable you to interact with the data with AWS Athena, AWS SageMaker, AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda-ml/README.md.

Search
Clear search
Close search
Google apps
Main menu