Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.
Facebook
TwitterThis Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:
More reviews:
New reviews:
Metadata: - We have added transaction metadata for each review shown on the review page.
If you publish articles based on this dataset, please cite the following paper:
Facebook
TwitterOpen Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:
metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:
ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:
ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:
file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:
A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:
Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:
Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:
Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
The Amazon Bin Image Dataset contains 536,434 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. This dataset has many images and the corresponding medadata.
The image files have three groups according to its naming scheme.
Amazon Bin Image Dataset File List dataset aims to provide a CSV file to contain all file locations and the quantity to help the analysis and distributed learning.
Facebook
TwitterThis dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the Amazon Review Dataset.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
AMZScout is a website that offers data for helping you grow your small Amazon business. While AMZScout does offer an Excel export service, not only is it locked to their outrageously priced ($27.99+) paid tier but it is also suboptimal for Data Science and ML applications. Not being one to back down from a good old-fashioned web-scraping challenge, I developed a utility for dumping the data out of the browser. The data generated isn't perfect, there's far from enough to actually do any major work with it, and generating it takes the better part of a week (good luck not getting IP banned by both AMZScout & Amazon), but it should be more than enough to whet your appetite on some business data.
This dataset is depreciated in favor of the SmartScout dataset. Kaggle page coming soon.
| # | Thumbnail Image | Product Name | URL | Description | About this item | From the manufacturer | Brand | Category | Product Score for PL | Product Score for Reselling | Number of Sellers | Rank | Price | FBA Fees | Net Margin | Est. Sales | UPC | Est. Revenue | # of Reviews | RPR | Rating | LQS | Weight | Variants | Available From | Seller Type |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank relative to similar products (returned in same query). Lower is better. | Thumbnail Image encoded in base64 with Content-Type flag included | Product Name | AZIN URL | Description | About this item | From the manufacturer (also known as aplus description) | Brand | Category (top level) | Arbitrary, see AMZScout | Arbitrary, see AMZScout | Number of unique Sellers selling product (i.e. "from other sources" on amazon) | Rank in top level category by sales | Price as seen in Buy Box | FBA Fees | Net Margin | Est. Sales | UPC | Est. Revenue | # of Reviews | Lookup "RPR" | Rating | Lookup "LQS" | Weight, in pounds | # of different Variants (i.e. 16GB, 32GB) | Available From | Seller Type, i.e. FBA |
If you use this dataset in your project, please consider citing it using this CITATION.cff. Thank you for your consideration.
If you would like to use the scripts to generate your own version of this dataset, see my GitHub repository https://github.com/regulad/amzscout-scrape. In short, you'll need a computer with any version of Python 3.11 and enough memory to run many headless Chrome, Edge, or Chromium windows.
Facebook
TwitterThis a Parquet representation of the Open Targets Platform's latest export. The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank target-disease associations for drug target identification. The Open Targets Platform (https://www.targetvalidation.org) is a freely available resource for the integration of genetics, genomics, and chemical data to aid systematic drug target identification and prioritisation. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue catalog entries into your account in about 30 seconds. That one step will enable you to write SQL with AWS Athena, build dashboards and charts with Amazon Quicksight, perform HPC with AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda/README.md.
Facebook
TwitterAttribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Note that you can download quickly via CLI. (Kaggle Environment: 1min 36s, Colab: 1min)
! kaggle datasets download williamhyun/amazon-bin-image-dataset-536434-images-224x224
The Amazon Bin Image Dataset contains 536,434 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. This dataset has many images and the corresponding medadata.
The image files have three groups according to its naming scheme.
Amazon Bin Image Dataset (536,434 images, 224x224) dataset aims to provide a resized image files and a full metadata SQLite file for Kaggle Kernel environments. You can download a single 4GB archive file via Download button on this page.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
I processed the data from the AWS CLI and boto3 repositories using the GitHub API, if the data is processed in a meaningful way, it could be a great way to help the mainteiners of those repositories to get better insights about the issues the customers are having by clustering them.
Facebook
TwitterDataset Card for Dataset Name
Original dataset can be found on: https://amazon-reviews-2023.github.io/
Dataset Details
This dataset is downloaded from the link above, the category Sports and Outdoors meta dataset.
Dataset Description
This dataset is a refined version of the Amazon Sports and Outdoors 2023 meta dataset, which originally contained product metadata for sports and outdoors products that are sold on Amazon. The dataset includes detailed information⌠See the full description on the dataset page: https://huggingface.co/datasets/smartcat/Amazon_Sports_and_Outdoors_2023.
Facebook
TwitterDataset Attribution
This dataset is derived from the original PersonPath22 dataset introduced in the ECCV 2022 paper:
Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, and Joseph Tighe. "Large Scale Real-world Multi-Person Tracking." European Conference on Computer Vision (ECCV), 2022.
The original dataset, including videos and annotations, is available at the PersonPath22 Homepage.([amazon-science.github.io][1])
In this version, I have transformed the dataset from its original video format into image sequences and reorganized it to align with the universal person tracking format, comprising directories such as bounding_box_train, bounding_box_test, and query.
License
The original PersonPath22 dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This transformed version adheres to the same licensing terms.([amazon-science.github.io][1])
Facebook
TwitterDNA sequence data of UCE loci collected from the world's bird species (n=10,560).
Facebook
TwitterThe AWS Public Blockchain Data initiative provides free access to blockchain datasets through collaboration with data providers. The data is optimized for analytics by being transformed into compressed Parquet files, partitioned by date for efficient querying.
s3://aws-public-blockchain/v1.0/btc/s3://aws-public-blockchain/v1.0/eth/s3://aws-public-blockchain/v1.1/sonarx/arbitrum/s3://aws-public-blockchain/v1.1/sonarx/aptos/s3://aws-public-blockchain/v1.1/sonarx/base/s3://aws-public-blockchain/v1.1/sonarx/provenance/s3://aws-public-blockchain/v1.1/sonarx/xrp/s3://aws-public-blockchain/v1.1/stellar/s3://aws-public-blockchain/v1.1/ton/s3://aws-public-blockchain/v1.1/cronos/We welcome additional blockchain data providers to join this initiative. If you're interested in contributing datasets to the AWS Public Blockchain Data program, please contact our team at aws-public-blockchain@amazon.com.
Facebook
TwitterUnlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.
Data Sources:
GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.
StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.
DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.
Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.
With our datasets, you'll receive:
Choose from various output formats, storage options, and delivery frequencies:
Why choose our Datasets?
Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.
Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.
Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.
Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.
Pricing Options:
Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.
Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.
Experience a seamless journey with Oxylabs:
Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
Facebook
TwitterIntel Product Reviews Dataset
This dataset contains scraped reviews from Amazon for various Intel products. The data includes details such as product titles, review texts, star ratings, and other relevant attributes. It is ideal for sentiment analysis, natural language processing (NLP), and machine learning tasks related to consumer feedback analysis. Whether you're interested in understanding customer sentiments towards Intel products or building predictive models based on reviews, this dataset provides a rich resource for exploration and analysis.
Features: - user_id: Unique identifier for the reviewer. - asin: Amazon Standard Identification Number for the product. - rating: Star rating given by the reviewer (1 to 5 stars). - helpful_vote: Number of helpful votes received for the review. - verified_purchase: Indicates if the purchase was verified (true/false). - text: Textual content of the review.
Potential Use Cases: - Sentiment analysis and opinion mining. - Predictive modeling of customer satisfaction. - Comparative analysis of different Intel products based on customer feedback. - Text mining and feature extraction for NLP applications.
Acknowledgments: - The dataset was collected from publicly available reviews on Amazon (https://amazon-reviews-2023.github.io/).
Note: Please respect terms of use and ethical guidelines when using this dataset for research or commercial purposes.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This is a subset of the cpg0016-jump dataset (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/). The selected subset contains 517 images with four channels in the form of a single multi-channel tiff file. Gaussian noise is applied to each image to simulate the detector noise. The GitHub repository with the details about the original dataset is available at https://github.com/jump-cellpainting/datasets. The preprint describing the original dataset is available on BioRxiv AI4Life has received funding from the European Unionâs Horizon Europe research and innovation programme under grant agreement number 101057970. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Facebook
TwitterMIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects Sign up for the gnomAD mailing list here. This dataset was derived from summary data from gnomAD release 3.1, available on the Registry of Open Data on AWS for ready enrollment into the Data Lake as Code.
Facebook
Twitterhttps://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
This is a side project for my thesis âClassification/Clustering Techniques for Large Web Data Collectionsâ.
My main goal was to provide a new, enriched, ground truth labeled dataset to the Machine Learning community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).
http://i.imgur.com/mAiuoO6.png" alt="Image">
Please, if you feel you can make any contribution that will improve this dataset, fork it on github.com.
The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012.
Data format:
where:
All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:
The new data format will be:
You can follow the steps mentioned below on how to get the enriched dataset:
Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).
Execute the python file enrich.py (it is available in the github project), so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.
Notice: Please be patient as the python script will take a while to parse all these reviews.
The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).
In fact,(the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.
Here is the code:
import gzip
import csv
import ast
def look_up(asin, diction):
try:
return diction[asin]
except KeyError:
return []
def load_labels():
labels_dictionary = {}
with open('labels.csv', mode='r') as infile:
csvreader = csv.reader(infile)
next(csvreader)
for rows in csvreader:
labels_dictionary[rows[0]] = ast.literal_eval(rows[1])
return labels_dictionary
def parse(filename):
labels_dict = load_labels()
f = gzip.open(filename, 'r')
entry = {}
for l in f:
l = l.strip()
colonPos = l.find(':')
if colonPos == -1:
yield entry
entry = {}
continue
eName = l[:colonPos]
rest = l[colonPos+2:]
entry[eName] = rest
if eName == 'product/productId':
entry['product/categories'] ...
Facebook
Twitter
NEW GOES-19 Data!! On April 4, 2025 at 1500 UTC, the GOES-19 satellite will be declared the Operational GOES-East satellite. All products and services, including NODD, for GOES-East will transition to GOES-19 data at that time. GOES-19 will operate out of the GOES-East location of 75.2°W starting on April 1, 2025 and through the operational transition. Until the transition time and during the final stretch of Post Launch Product Testing (PLPT), GOES-19 products are considered non-operational regardless of their validation maturity level. Shortly following the transition of GOES-19 to GOES-East, all data distribution from GOES-16 will be turned off. GOES-16 will drift to the storage location at 104.7°W. GOES-19 data should begin flowing again on April 4th once this maneuver is complete.
NEW GOES 16 Reprocess Data!! The reprocessed GOES-16 ABI L1b data mitigates systematic data issues (including data gaps and image artifacts) seen in the Operational products, and improves the stability of both the radiometric and geometric calibration over the course of the entire mission life. These data were produced by recomputing the L1b radiance products from input raw L0 data using improved calibration algorithms and look-up tables, derived from data analysis of the NIST-traceable, on-board sources. In addition, the reprocessed data products contain enhancements to the L1b file format, including limb pixels and pixel timestamps, while maintaining compatibility with the operational products. The datasets currently available span the operational life of GOES-16 ABI, from early 2018 through the end of 2024. The Reprocessed L1b dataset shows improvement over the Operational L1b products but may still contain data gaps or discrepancies. Please provide feedback to Dan Lindsey (dan.lindsey@noaa.gov) and Gary Lin (guoqing.lin-1@nasa.gov). More information can be found in the GOES-R ABI Reprocess User Guide.
NOTICE: As of January 10th 2023, GOES-18 assumed the GOES-West position and all data files are deemed both operational and provisional, so no âpreliminary, non-operationalâ caveat is needed. GOES-17 is now offline, shifted approximately 105 degree West, where it will be in on-orbit storage. GOES-17 data will no longer flow into the GOES-17 bucket. Operational GOES-West products can be found in the GOES-18 bucket.
GOES satellites (GOES-16, GOES-17, GOES-18 & GOES-19) provide continuous weather imagery and
monitoring of meteorological and space environment data across North America.
GOES satellites provide the kind of continuous monitoring necessary for
intensive data analysis. They hover continuously over one position on the surface.
The satellites orbit high enough to allow for a full-disc view of the Earth. Because
they stay above a fixed spot on the surface, they provide a constant vigil for the
atmospheric "triggers" for severe weather conditions such as tornadoes, flash floods,
hailstorms, and hurricanes. When these conditions develop, the GOES satellites are able
to monitor storm development and track their movements. SUVI products available in both NetCDF and FITS.
Facebook
TwitterAttribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
This dataset supports the development of CLIPn, a contrastive-learning framework designed to align heterogeneous high-content screening (HCS) profile datasets.GitHub link: https://github.com/AltschulerWu-Lab/CLIPnDirectory StructureFoldersraw_profilesHCS13/Contains raw data from 13 high-content screening (HCS) datasets. Each dataset includes meta and feature files.L1000/CDRP_feature_exp.csv: Raw L1000 expression data from the CDRP dataset.CDRP_meta_exp.csv: Metadata associated with the CDRP expression data.LINCS_feature_exp.csv: Raw L1000 expression data from the LINCS dataset.LINCS_meta_exp.csv: Metadata associated with the LINCS expression data.RxRx3/RxRx3_feature_final.csv: Profile data from the RxRx3 dataset.RxRx3_meta_final.csv: Metadata from the RxRx3 dataset.Uncharacterized_compounds/NCI_cpnData.csv: Feature data for uncharacterized compounds from the NCI dataset.NCI_cpnInfo.csv: Information about uncharacterized compounds in the NCI dataset.Prestwick_UTSW_cpnData.csv: Feature data for uncharacterized compounds from the Prestwick UTSW dataset.Prestwick_UTSW_cpnInfo.csv: Information about uncharacterized compounds from the Prestwick UTSW dataset.Data ReferenceFor raw datasets from 13 HCS database, data and analysis pipeline for dataset 1 was obtained from https://www.science.org/doi/suppl/10.1126/science.1100709/suppl_file/perlman.som.zip; for datasets 2-3, data were shared by authors; For datasets 4-5, analysis code was downloaded from https://static-content.springer.com/esm/art:10.1038/nbt.3419/MediaObjects/41587_2016_BFnbt3419_MOESM21_ESM.zip and data were shared by authors; For datasets 6-7, processed dataset was downloaded from AWS following instructions from https://github.com/carpenter-singh-lab/2022_Haghighi_NatureMethods, and replicate_level_cp_normalized.csv.gz features were used. For project datasets 8-13, datasets and analysis results were downloaded from https://zenodo.org/records/7352487. For RxRx3, dataset was obtained from https://www.rxrx.ai/rxrx3. L1000 transcript datasets were downloaded using the same link as datasets 6-7 and the processed transcript data files (named âreplicate_level_l1k.csvâ) were used.
Facebook
TwitterApache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.