93 datasets found

Registry of Open Data on AWS
registry.opendata.aws
Updated Aug 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/
Explore at:
Dataset updated
Aug 13, 2021
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
License
Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically
Description
The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.
u
Amazon review data 2018
mcauleylab.ucsd.edu
nijianmo.github.io
+1more
Updated May 31, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
UCSD CSE Research Project (2023). Amazon review data 2018 [Dataset]. https://mcauleylab.ucsd.edu:8443/public_datasets/data/amazon_v2/
Explore at:
Dataset updated
May 31, 2023
Dataset authored and provided by
UCSD CSE Research Project
Description
Context

This Dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). In addition, this version provides the following features:

More reviews:

The total number of reviews is 233.1 million (142.8 million in 2014).

New reviews:

Current data includes reviews in the range May 1996 - Oct 2018.

Metadata: - We have added transaction metadata for each review shown on the review page.

Added more detailed metadata of the product landing page.

Acknowledgements

If you publish articles based on this dataset, please cite the following paper:

Jianmo Ni, Jiacheng Li, Julian McAuley. Justifying recommendations using distantly-labeled reviews and fined-grained aspects. EMNLP, 2019.
Z
PIPr: A Dataset of Public Infrastructure as Code Programs
data.niaid.nih.gov
zenodo.org
Updated Nov 28, 2023
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Sokolowski, Daniel; Spielmann, David; Salvaneschi, Guido (2023). PIPr: A Dataset of Public Infrastructure as Code Programs [Dataset]. https://data.niaid.nih.gov/resources?id=zenodo_8262770
Explore at:
Dataset updated
Nov 28, 2023
Dataset provided by
University of St. Gallen
Authors
Sokolowski, Daniel; Spielmann, David; Salvaneschi, Guido
License
Open Data Commons Attribution License (ODC-By) v1.0https://www.opendatacommons.org/licenses/by/1.0/
License information was derived automatically
Description
Programming Languages Infrastructure as Code (PL-IaC) enables IaC programs written in general-purpose programming languages like Python and TypeScript. The currently available PL-IaC solutions are Pulumi and the Cloud Development Kits (CDKs) of Amazon Web Services (AWS) and Terraform. This dataset provides metadata and initial analyses of all public GitHub repositories in August 2022 with an IaC program, including their programming languages, applied testing techniques, and licenses. Further, we provide a shallow copy of the head state of those 7104 repositories whose licenses permit redistribution. The dataset is available under the Open Data Commons Attribution License (ODC-By) v1.0. Contents:

metadata.zip: The dataset metadata and analysis results as CSV files. scripts-and-logs.zip: Scripts and logs of the dataset creation. LICENSE: The Open Data Commons Attribution License (ODC-By) v1.0 text. README.md: This document. redistributable-repositiories.zip: Shallow copies of the head state of all redistributable repositories with an IaC program. This artifact is part of the ProTI Infrastructure as Code testing project: https://proti-iac.github.io. Metadata The dataset's metadata comprises three tabular CSV files containing metadata about all analyzed repositories, IaC programs, and testing source code files. repositories.csv:

ID (integer): GitHub repository ID url (string): GitHub repository URL downloaded (boolean): Whether cloning the repository succeeded name (string): Repository name description (string): Repository description licenses (string, list of strings): Repository licenses redistributable (boolean): Whether the repository's licenses permit redistribution created (string, date & time): Time of the repository's creation updated (string, date & time): Time of the last update to the repository pushed (string, date & time): Time of the last push to the repository fork (boolean): Whether the repository is a fork forks (integer): Number of forks archive (boolean): Whether the repository is archived programs (string, list of strings): Project file path of each IaC program in the repository programs.csv:

ID (string): Project file path of the IaC program repository (integer): GitHub repository ID of the repository containing the IaC program directory (string): Path of the directory containing the IaC program's project file solution (string, enum): PL-IaC solution of the IaC program ("AWS CDK", "CDKTF", "Pulumi") language (string, enum): Programming language of the IaC program (enum values: "csharp", "go", "haskell", "java", "javascript", "python", "typescript", "yaml") name (string): IaC program name description (string): IaC program description runtime (string): Runtime string of the IaC program testing (string, list of enum): Testing techniques of the IaC program (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") tests (string, list of strings): File paths of IaC program's tests testing-files.csv:

file (string): Testing file path language (string, enum): Programming language of the testing file (enum values: "csharp", "go", "java", "javascript", "python", "typescript") techniques (string, list of enum): Testing techniques used in the testing file (enum values: "awscdk", "awscdk_assert", "awscdk_snapshot", "cdktf", "cdktf_snapshot", "cdktf_tf", "pulumi_crossguard", "pulumi_integration", "pulumi_unit", "pulumi_unit_mocking") keywords (string, list of enum): Keywords found in the testing file (enum values: "/go/auto", "/testing/integration", "@AfterAll", "@BeforeAll", "@Test", "@aws-cdk", "@aws-cdk/assert", "@pulumi.runtime.test", "@pulumi/", "@pulumi/policy", "@pulumi/pulumi/automation", "Amazon.CDK", "Amazon.CDK.Assertions", "Assertions_", "HashiCorp.Cdktf", "IMocks", "Moq", "NUnit", "PolicyPack(", "ProgramTest", "Pulumi", "Pulumi.Automation", "PulumiTest", "ResourceValidationArgs", "ResourceValidationPolicy", "SnapshotTest()", "StackValidationPolicy", "Testing", "Testing_ToBeValidTerraform(", "ToBeValidTerraform(", "Verifier.Verify(", "WithMocks(", "[Fact]", "[TestClass]", "[TestFixture]", "[TestMethod]", "[Test]", "afterAll(", "assertions", "automation", "aws-cdk-lib", "aws-cdk-lib/assert", "aws_cdk", "aws_cdk.assertions", "awscdk", "beforeAll(", "cdktf", "com.pulumi", "def test_", "describe(", "github.com/aws/aws-cdk-go/awscdk", "github.com/hashicorp/terraform-cdk-go/cdktf", "github.com/pulumi/pulumi", "integration", "junit", "pulumi", "pulumi.runtime.setMocks(", "pulumi.runtime.set_mocks(", "pulumi_policy", "pytest", "setMocks(", "set_mocks(", "snapshot", "software.amazon.awscdk.assertions", "stretchr", "test(", "testing", "toBeValidTerraform(", "toMatchInlineSnapshot(", "toMatchSnapshot(", "to_be_valid_terraform(", "unittest", "withMocks(") program (string): Project file path of the testing file's IaC program Dataset Creation scripts-and-logs.zip contains all scripts and logs of the creation of this dataset. In it, executions/executions.log documents the commands that generated this dataset in detail. On a high level, the dataset was created as follows:

A list of all repositories with a PL-IaC program configuration file was created using search-repositories.py (documented below). The execution took two weeks due to the non-deterministic nature of GitHub's REST API, causing excessive retries. A shallow copy of the head of all repositories was downloaded using download-repositories.py (documented below). Using analysis.ipynb, the repositories were analyzed for the programs' metadata, including the used programming languages and licenses. Based on the analysis, all repositories with at least one IaC program and a redistributable license were packaged into redistributable-repositiories.zip, excluding any node_modules and .git directories. Searching Repositories The repositories are searched through search-repositories.py and saved in a CSV file. The script takes these arguments in the following order:

Github access token. Name of the CSV output file. Filename to search for. File extensions to search for, separated by commas. Min file size for the search (for all files: 0). Max file size for the search or * for unlimited (for all files: *). Pulumi projects have a Pulumi.yaml or Pulumi.yml (case-sensitive file name) file in their root folder, i.e., (3) is Pulumi and (4) is yml,yaml. https://www.pulumi.com/docs/intro/concepts/project/ AWS CDK projects have a cdk.json (case-sensitive file name) file in their root folder, i.e., (3) is cdk and (4) is json. https://docs.aws.amazon.com/cdk/v2/guide/cli.html CDK for Terraform (CDKTF) projects have a cdktf.json (case-sensitive file name) file in their root folder, i.e., (3) is cdktf and (4) is json. https://www.terraform.io/cdktf/create-and-deploy/project-setup Limitations The script uses the GitHub code search API and inherits its limitations:

Only forks with more stars than the parent repository are included. Only the repositories' default branches are considered. Only files smaller than 384 KB are searchable. Only repositories with fewer than 500,000 files are considered. Only repositories that have had activity or have been returned in search results in the last year are considered. More details: https://docs.github.com/en/search-github/searching-on-github/searching-code The results of the GitHub code search API are not stable. However, the generally more robust GraphQL API does not support searching for files in repositories: https://stackoverflow.com/questions/45382069/search-for-code-in-github-using-graphql-v4-api Downloading Repositories download-repositories.py downloads all repositories in CSV files generated through search-respositories.py and generates an overview CSV file of the downloads. The script takes these arguments in the following order:

Name of the repositories CSV files generated through search-repositories.py, separated by commas. Output directory to download the repositories to. Name of the CSV output file. The script only downloads a shallow recursive copy of the HEAD of the repo, i.e., only the main branch's most recent state, including submodules, without the rest of the git history. Each repository is downloaded to a subfolder named by the repository's ID.
Amazon Bin Image Dataset File List
kaggle.com
zip
Updated Apr 23, 2022
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Hyun (2022). Amazon Bin Image Dataset File List [Dataset]. https://www.kaggle.com/datasets/williamhyun/amazon-bin-image-dataset-file-list
Explore at:
zip(1717793 bytes)Available download formats
Dataset updated
Apr 23, 2022
Authors
William Hyun
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Amazon Bin Image Dataset

The Amazon Bin Image Dataset contains 536,434 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. This dataset has many images and the corresponding medadata.

The image files have three groups according to its naming scheme.

A file name with 1~4 digits (1,200): 1.jpg ~ 1200.jpg

A file name with 5 digits (99,999): 00001.jpg ~ 99999.jpg

A file name with 6 digits (435,235): 100000.jpg ~ 535234.jpg

Amazon Bin Image Dataset File List dataset aims to provide a CSV file to contain all file locations and the quantity to help the analysis and distributed learning.

Documentation

https://github.com/awslabs/open-data-docs/tree/main/docs/aft-vbi-pds

Download

https://registry.opendata.aws/amazon-bin-imagery/

https://github.com/awslabs/open-data-registry/blob/main/datasets/amazon-bin-imagery.yaml
o
Phrase Clustering Dataset (PCD)
registry.opendata.aws
Updated Nov 21, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon (2022). Phrase Clustering Dataset (PCD) [Dataset]. https://registry.opendata.aws/pcd/
Explore at:
Dataset updated
Nov 21, 2022
Dataset provided by
Amazon.comhttp://amazon.com/
Description
This dataset is part of the paper "McPhraSy: Multi-Context Phrase Similarity and Clustering" by DN Cohen et al (2022). The purpose of PCD is to evaluate the quality of semantic-based clustering of noun phrases. The phrases were collected from the Amazon Review Dataset.

Amazon AMZScout PRO Data + Description

kaggle.com

zip

Updated Jun 20, 2023

Facebook

Twitter

Click to copy link

Link copied

Cite

Parker Wahle (2023). Amazon AMZScout PRO Data + Description [Dataset]. https://www.kaggle.com/datasets/regulad/amazon-amzscout-pro-data-description

Explore at:

zip(167959893 bytes)Available download formats

Dataset updated

Jun 20, 2023

Authors

Parker Wahle

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

AMZScout is a website that offers data for helping you grow your small Amazon business. While AMZScout does offer an Excel export service, not only is it locked to their outrageously priced ($27.99+) paid tier but it is also suboptimal for Data Science and ML applications. Not being one to back down from a good old-fashioned web-scraping challenge, I developed a utility for dumping the data out of the browser. The data generated isn't perfect, there's far from enough to actually do any major work with it, and generating it takes the better part of a week (good luck not getting IP banned by both AMZScout & Amazon), but it should be more than enough to whet your appetite on some business data.

Depreciation Notice

This dataset is depreciated in favor of the SmartScout dataset. Kaggle page coming soon.

Brief explanation of the columns

#	Thumbnail Image	Product Name	URL	Description	About this item	From the manufacturer	Brand	Category	Product Score for PL	Product Score for Reselling	Number of Sellers	Rank	Price	FBA Fees	Net Margin	Est. Sales	UPC	Est. Revenue	# of Reviews	RPR	Rating	LQS	Weight	Variants	Available From	Seller Type
Rank relative to similar products (returned in same query). Lower is better.	Thumbnail Image encoded in base64 with Content-Type flag included	Product Name	AZIN URL	Description	About this item	From the manufacturer (also known as aplus description)	Brand	Category (top level)	Arbitrary, see AMZScout	Arbitrary, see AMZScout	Number of unique Sellers selling product (i.e. "from other sources" on amazon)	Rank in top level category by sales	Price as seen in Buy Box	FBA Fees	Net Margin	Est. Sales	UPC	Est. Revenue	# of Reviews	Lookup "RPR"	Rating	Lookup "LQS"	Weight, in pounds	# of different Variants (i.e. 16GB, 32GB)	Available From	Seller Type, i.e. FBA

Attribution

If you use this dataset in your project, please consider citing it using this CITATION.cff. Thank you for your consideration.

Generation/updating

If you would like to use the scripts to generate your own version of this dataset, see my GitHub repository https://github.com/regulad/amzscout-scrape. In short, you'll need a computer with any version of Python 3.11 and enough memory to run many headless Chrome, Edge, or Chromium windows.

Open Targets - Data Lakehouse Ready
registry.opendata.aws
Updated Sep 15, 2020
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2020). Open Targets - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/opentargets/
Explore at:
Dataset updated
Sep 15, 2020
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
Description
This a Parquet representation of the Open Targets Platform's latest export. The Open Targets Platform integrates evidence from genetics, genomics, transcriptomics, drugs, animal models and scientific literature to score and rank target-disease associations for drug target identification. The Open Targets Platform (https://www.targetvalidation.org) is a freely available resource for the integration of genetics, genomics, and chemical data to aid systematic drug target identification and prioritisation. This dataset is 'Lakehouse Ready'. Meaning, you can query this data in-place straight out of the Registry of Open Data S3 bucket. Deploy this dataset's corresponding CloudFormation template to create the AWS Glue catalog entries into your account in about 30 seconds. That one step will enable you to write SQL with AWS Athena, build dashboards and charts with Amazon Quicksight, perform HPC with AWS EMR, or join into your AWS Redshift clusters. More detail in (the documentation)[https://github.com/aws-samples/data-lake-as-code/blob/roda/README.md.
Amazon Bin Image Dataset (536,434 images, 224x224)
kaggle.com
zip
Updated Apr 24, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
William Hyun (2022). Amazon Bin Image Dataset (536,434 images, 224x224) [Dataset]. https://www.kaggle.com/datasets/williamhyun/amazon-bin-image-dataset-536434-images-224x224/versions/1
Explore at:
zip(3754485316 bytes)Available download formats
Dataset updated
Apr 24, 2022
Authors
William Hyun
License
Attribution-NonCommercial-ShareAlike 4.0 (CC BY-NC-SA 4.0)https://creativecommons.org/licenses/by-nc-sa/4.0/
License information was derived automatically
Description
Note that you can download quickly via CLI. (Kaggle Environment: 1min 36s, Colab: 1min) ! kaggle datasets download williamhyun/amazon-bin-image-dataset-536434-images-224x224

Amazon Bin Image Dataset

The Amazon Bin Image Dataset contains 536,434 images and metadata from bins of a pod in an operating Amazon Fulfillment Center. The bin images in this dataset are captured as robot units carry pods as part of normal Amazon Fulfillment Center operations. This dataset has many images and the corresponding medadata.

The image files have three groups according to its naming scheme.

A file name with 1~4 digits (1,200): 1.jpg ~ 1200.jpg

A file name with 5 digits (99,999): 00001.jpg ~ 99999.jpg

A file name with 6 digits (435,235): 100000.jpg ~ 535234.jpg

Amazon Bin Image Dataset (536,434 images, 224x224) dataset aims to provide a resized image files and a full metadata SQLite file for Kaggle Kernel environments. You can download a single 4GB archive file via Download button on this page.

Documentation

https://github.com/awslabs/open-data-docs/tree/main/docs/aft-vbi-pds

Download

https://registry.opendata.aws/amazon-bin-imagery/

https://github.com/awslabs/open-data-registry/blob/main/datasets/amazon-bin-imagery.yaml
AWS CLI and boto3 github issues
kaggle.com
zip
Updated Feb 11, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
ElizarrarasL (2022). AWS CLI and boto3 github issues [Dataset]. https://www.kaggle.com/datasets/elizarrarasl/aws-cli-and-boto3-github-issues
Explore at:
zip(3291132 bytes)Available download formats
Dataset updated
Feb 11, 2022
Authors
ElizarrarasL
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Context

I processed the data from the AWS CLI and boto3 repositories using the GitHub API, if the data is processed in a meaningful way, it could be a great way to help the mainteiners of those repositories to get better insights about the issues the customers are having by clustering them.

Content

The data is updated at the time and date this dataset was made public.

Both JSON files have the same structure, "closed_issues" and "open_issues", it is an array of issues, where each element have the title, the body and the assigned labels.

Considerations

Old closed issues (last ones, I did not add the date per issue) don't have assigned labels, and if they have, it is "bug" or "guidance", it seems that the open ones all have labels. Even though, NLP may be really helpful for to get insights of the data.
h
Amazon_Sports_and_Outdoors_2023
huggingface.co
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
SmartCat, Amazon_Sports_and_Outdoors_2023 [Dataset]. https://huggingface.co/datasets/smartcat/Amazon_Sports_and_Outdoors_2023
Explore at:
CroissantCroissant is a format for machine-learning datasets. Learn more about this at mlcommons.org/croissant.
Dataset authored and provided by
SmartCat
Description
Dataset Card for Dataset Name

Original dataset can be found on: https://amazon-reviews-2023.github.io/

Dataset Details

This dataset is downloaded from the link above, the category Sports and Outdoors meta dataset.

Dataset Description

This dataset is a refined version of the Amazon Sports and Outdoors 2023 meta dataset, which originally contained product metadata for sports and outdoors products that are sold on Amazon. The dataset includes detailed information… See the full description on the dataset page: https://huggingface.co/datasets/smartcat/Amazon_Sports_and_Outdoors_2023.
Amazon tracking dataset personpath22
kaggle.com
zip
Updated Jun 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Fateh Mujtaba (2025). Amazon tracking dataset personpath22 [Dataset]. https://www.kaggle.com/datasets/fatehmujtaba/amazon-tracking-dataset-personpath22
Explore at:
zip(6359124022 bytes)Available download formats
Dataset updated
Jun 4, 2025
Authors
Fateh Mujtaba
Description
Dataset Attribution

This dataset is derived from the original PersonPath22 dataset introduced in the ECCV 2022 paper:

Bing Shuai, Alessandro Bergamo, Uta Buechler, Andrew Berneshawi, Alyssa Boden, and Joseph Tighe. "Large Scale Real-world Multi-Person Tracking." European Conference on Computer Vision (ECCV), 2022.

The original dataset, including videos and annotations, is available at the PersonPath22 Homepage.([amazon-science.github.io][1])

In this version, I have transformed the dataset from its original video format into image sequences and reorganized it to align with the universal person tracking format, comprising directories such as bounding_box_train, bounding_box_test, and query.

License

The original PersonPath22 dataset is distributed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). This transformed version adheres to the same licensing terms.([amazon-science.github.io][1])
o
OpenWings OpenData
registry.opendata.aws
Updated Jun 3, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
OpenWings Project (2025). OpenWings OpenData [Dataset]. https://registry.opendata.aws/openwings/
Explore at:
Dataset updated
Jun 3, 2025
Dataset provided by
<a href="https://github.com/openwings-project">OpenWings Project</a>
Description
DNA sequence data of UCE loci collected from the world's bird species (n=10,560).
AWS Public Blockchain Data
registry.opendata.aws
Updated Sep 23, 2022
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2022). AWS Public Blockchain Data [Dataset]. https://registry.opendata.aws/aws-public-blockchain/
Explore at:
Dataset updated
Sep 23, 2022
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
Description
The AWS Public Blockchain Data initiative provides free access to blockchain datasets through collaboration with data providers. The data is optimized for analytics by being transformed into compressed Parquet files, partitioned by date for efficient querying.

Datasets
Blockchain dataset - Maintained by - Path:
- Bitcoin - AWS - s3://aws-public-blockchain/v1.0/btc/
- Ethereum - AWS - s3://aws-public-blockchain/v1.0/eth/
- Arbitrum - SonarX - s3://aws-public-blockchain/v1.1/sonarx/arbitrum/
- Aptos - SonarX - s3://aws-public-blockchain/v1.1/sonarx/aptos/
- Base - SonarX - s3://aws-public-blockchain/v1.1/sonarx/base/
- Provenance - SonarX - s3://aws-public-blockchain/v1.1/sonarx/provenance/
- XRP Ledger - SonarX - s3://aws-public-blockchain/v1.1/sonarx/xrp/
- Stellar(XDR files) - Stellar - s3://aws-public-blockchain/v1.1/stellar/
- The Open Network (TON) - TON - s3://aws-public-blockchain/v1.1/ton/
- Cronos - Cronos - s3://aws-public-blockchain/v1.1/cronos/

Become a Data Provider

We welcome additional blockchain data providers to join this initiative. If you're interested in contributing datasets to the AWS Public Blockchain Data program, please contact our team at aws-public-blockchain@amazon.com.
Developer Community and Code Datasets
datarade.ai
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Oxylabs, Developer Community and Code Datasets [Dataset]. https://datarade.ai/data-products/developer-community-and-code-datasets-oxylabs
Explore at:
.bin, .json, .xml, .csv, .xls, .sql, .txtAvailable download formats
Dataset authored and provided by
Oxylabs
Area covered
El Salvador, Tuvalu, Marshall Islands, Philippines, United Kingdom, South Sudan, Djibouti, Guyana, Saint Pierre and Miquelon, Bahamas
Description
Unlock the power of ready-to-use data sourced from developer communities and repositories with Developer Community and Code Datasets.

Data Sources:

GitHub: Access comprehensive data about GitHub repositories, developer profiles, contributions, issues, social interactions, and more.

StackShare: Receive information about companies, their technology stacks, reviews, tools, services, trends, and more.

DockerHub: Dive into data from container images, repositories, developer profiles, contributions, usage statistics, and more.

Developer Community and Code Datasets are a treasure trove of public data points gathered from tech communities and code repositories across the web.

With our datasets, you'll receive:

Usernames;

Companies;

Locations;

Job Titles;

Follower Counts;

Contact Details;

Employability Statuses;

And More.

Choose from various output formats, storage options, and delivery frequencies:

Get datasets in CSV, JSON, or other preferred formats.

Opt for data delivery via SFTP or directly to your cloud storage, such as AWS S3.

Receive datasets either once or as per your agreed-upon schedule.

Why choose our Datasets?

Fresh and accurate data: Access complete, clean, and structured data from scraping professionals, ensuring the highest quality.

Time and resource savings: Let us handle data extraction and processing cost-effectively, freeing your resources for strategic tasks.

Customized solutions: Share your unique data needs, and we'll tailor our data harvesting approach to fit your requirements perfectly.

Legal compliance: Partner with a trusted leader in ethical data collection. Oxylabs is trusted by Fortune 500 companies and adheres to GDPR and CCPA standards.

Pricing Options:

Standard Datasets: choose from various ready-to-use datasets with standardized data schemas, priced from $1,000/month.

Custom Datasets: Tailor datasets from any public web domain to your unique business needs. Contact our sales team for custom pricing.

Experience a seamless journey with Oxylabs:

Understanding your data needs: We work closely to understand your business nature and daily operations, defining your unique data requirements.

Developing a customized solution: Our experts create a custom framework to extract public data using our in-house web scraping infrastructure.

Delivering data sample: We provide a sample for your feedback on data quality and the entire delivery process.

Continuous data delivery: We continuously collect public data and deliver custom datasets per the agreed frequency.

Empower your data-driven decisions with Oxylabs Developer Community and Code Datasets!
Intel Product Reviews Dataset
kaggle.com
zip
Updated Jul 9, 2024
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Moses Macqui (2024). Intel Product Reviews Dataset [Dataset]. https://www.kaggle.com/datasets/anonymousniqqa/intel2
Explore at:
zip(912965 bytes)Available download formats
Dataset updated
Jul 9, 2024
Authors
Moses Macqui
Description
Intel Product Reviews Dataset

This dataset contains scraped reviews from Amazon for various Intel products. The data includes details such as product titles, review texts, star ratings, and other relevant attributes. It is ideal for sentiment analysis, natural language processing (NLP), and machine learning tasks related to consumer feedback analysis. Whether you're interested in understanding customer sentiments towards Intel products or building predictive models based on reviews, this dataset provides a rich resource for exploration and analysis.

Features: - user_id: Unique identifier for the reviewer. - asin: Amazon Standard Identification Number for the product. - rating: Star rating given by the reviewer (1 to 5 stars). - helpful_vote: Number of helpful votes received for the review. - verified_purchase: Indicates if the purchase was verified (true/false). - text: Textual content of the review.

Potential Use Cases: - Sentiment analysis and opinion mining. - Predictive modeling of customer satisfaction. - Comparative analysis of different Intel products based on customer feedback. - Text mining and feature extraction for NLP applications.

Acknowledgments: - The dataset was collected from publicly available reviews on Amazon (https://amazon-reviews-2023.github.io/).

Note: Please respect terms of use and ethical guidelines when using this dataset for research or commercial purposes.
AI4Life-MDC24 Challenge data: JUMP Cell Painting Datasets
data.europa.eu
unknown
Updated Jul 4, 2025
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Zenodo (2025). AI4Life-MDC24 Challenge data: JUMP Cell Painting Datasets [Dataset]. https://data.europa.eu/data/datasets/oai-zenodo-org-10912386?locale=ga
Explore at:
unknownAvailable download formats
Dataset updated
Jul 4, 2025
Dataset authored and provided by
Zenodohttp://zenodo.org/
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This is a subset of the cpg0016-jump dataset (Chandrasekaran et al., 2022), available from the Cell Painting Gallery on the Registry of Open Data on AWS (https://registry.opendata.aws/cellpainting-gallery/). The selected subset contains 517 images with four channels in the form of a single multi-channel tiff file. Gaussian noise is applied to each image to simulate the detector noise. The GitHub repository with the details about the original dataset is available at https://github.com/jump-cellpainting/datasets. The preprint describing the original dataset is available on BioRxiv AI4Life has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement number 101057970. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.
Genome Aggregation Database (gnomAD) - Data Lakehouse Ready
registry.opendata.aws
Updated Sep 13, 2021
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Amazon Web Services (2021). Genome Aggregation Database (gnomAD) - Data Lakehouse Ready [Dataset]. https://registry.opendata.aws/gnomad-data-lakehouse-ready/
Explore at:
Dataset updated
Sep 13, 2021
Dataset provided by
Amazon Web Serviceshttp://aws.amazon.com/
License
MIT Licensehttps://opensource.org/licenses/MIT
License information was derived automatically
Description
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects Sign up for the gnomAD mailing list here. This dataset was derived from summary data from gnomAD release 3.1, available on the Registry of Open Data on AWS for ready enrollment into the Data Lake as Code.
Ground truth labels - Amazon movie reviews dataset
kaggle.com
zip
Updated Jul 8, 2017
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Konstantinos Bazakos (2017). Ground truth labels - Amazon movie reviews dataset [Dataset]. https://www.kaggle.com/thebuzz/ground-truth-labels-amazon-movie-reviews-dataset
Explore at:
zip(6829166 bytes)Available download formats
Dataset updated
Jul 8, 2017
Authors
Konstantinos Bazakos
License
https://creativecommons.org/publicdomain/zero/1.0/https://creativecommons.org/publicdomain/zero/1.0/
Description
Addition of ground truth labels on Amazon movie reviews

http://i.imgur.com/aDVUwMz.png" alt="Image">

What is it?

This is a side project for my thesis “Classification/Clustering Techniques for Large Web Data Collections”.

My main goal was to provide a new, enriched, ground truth labeled dataset to the Machine Learning community. All labels have been collected by crawling/scraping Amazon.com for a period of some months. By labels I mean the categories in which the products are classified (look the green underlined labels on the screenshot below).

http://i.imgur.com/mAiuoO6.png" alt="Image">

Please, if you feel you can make any contribution that will improve this dataset, fork it on github.com.

The original dataset

The Amazon Movies Reviews dataset consists of 7,911,684 reviews Amazon users left between Aug 1997 - Oct 2012.

Data format:

product/productId: B00006HAXW

review/userId: A1RSDE90N6RSZF

review/profileName: Joseph M. Kotow

review/helpfulness: 9/9

review/score: 5.0

review/time: 1042502400

review/summary: Pittsburgh - Home of the OLDIES

review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

where:

product/productId: asin, e.g. amazon.com/dp/B00006HAXW

review/userId: id of the user, e.g. A1RSDE90N6RSZF

review/profileName: name of the user

review/helpfulness: fraction of users who found the review helpful

review/score: rating of the product

review/time: time of the review (unix time)

review/summary: review summary

review/text: text of the review

The new labeled dataset

All the collected data (for every ASIN of the SNAP Dataset, ~253k products for ~8m reviews) are stored in a csv file labels.csv in the following format:

ASIN: unique identifier for the product

Categories: [label, label, label,..., label]

The new data format will be:

product/productId: B00006HAXW

review/userId: A1RSDE90N6RSZF

review/profileName: Joseph M. Kotow

review/helpfulness: 9/9

review/score: 5.0

review/time: 1042502400

review/summary: Pittsburgh - Home of the OLDIES

review/text: I have all of the doo wop DVD's and this one is as good or better than the 1st ones. Remember once these performers are gone, we'll never get to see them again. Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE this DVD!!

product/categories: ['CDs & Vinyl', 'Pop', 'Oldies', 'Doo Wop']

Instructions

You can follow the steps mentioned below on how to get the enriched dataset:

Download the original dataset from the SNAP website (~ 3.3 GB compressed) and put it in the root folder of the repository (where you can find also the labels.csv file).

Execute the python file enrich.py (it is available in the github project), so the new enriched multi-labeled dataset be exported. The name of the new file should be output.txt.gz.

Notice: Please be patient as the python script will take a while to parse all these reviews.

The python script generates a new compressed file that is actually same with the original one, but with an extra feature (product/categories).

In fact,(the python script) applies a mapping between ASIN values in both files and adds the labels data of the product in every review instance of that, as an extra column.

Here is the code:

import gzip import csv import ast def look_up(asin, diction): try: return diction[asin] except KeyError: return [] def load_labels(): labels_dictionary = {} with open('labels.csv', mode='r') as infile: csvreader = csv.reader(infile) next(csvreader) for rows in csvreader: labels_dictionary[rows[0]] = ast.literal_eval(rows[1]) return labels_dictionary def parse(filename): labels_dict = load_labels() f = gzip.open(filename, 'r') entry = {} for l in f: l = l.strip() colonPos = l.find(':') if colonPos == -1: yield entry entry = {} continue eName = l[:colonPos] rest = l[colonPos+2:] entry[eName] = rest if eName == 'product/productId': entry['product/categories'] ...
NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17, 18 &...
registry.opendata.aws
Updated Apr 4, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
NOAA (2025). NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17, 18 & 19 [Dataset]. https://registry.opendata.aws/noaa-goes/
Explore at:
Dataset updated
Apr 4, 2025
Dataset provided by
National Oceanic and Atmospheric Administrationhttp://www.noaa.gov/
Description

NEW GOES-19 Data!! On April 4, 2025 at 1500 UTC, the GOES-19 satellite will be declared the Operational GOES-East satellite. All products and services, including NODD, for GOES-East will transition to GOES-19 data at that time. GOES-19 will operate out of the GOES-East location of 75.2°W starting on April 1, 2025 and through the operational transition. Until the transition time and during the final stretch of Post Launch Product Testing (PLPT), GOES-19 products are considered non-operational regardless of their validation maturity level. Shortly following the transition of GOES-19 to GOES-East, all data distribution from GOES-16 will be turned off. GOES-16 will drift to the storage location at 104.7°W. GOES-19 data should begin flowing again on April 4th once this maneuver is complete.

NEW GOES 16 Reprocess Data!! The reprocessed GOES-16 ABI L1b data mitigates systematic data issues (including data gaps and image artifacts) seen in the Operational products, and improves the stability of both the radiometric and geometric calibration over the course of the entire mission life. These data were produced by recomputing the L1b radiance products from input raw L0 data using improved calibration algorithms and look-up tables, derived from data analysis of the NIST-traceable, on-board sources. In addition, the reprocessed data products contain enhancements to the L1b file format, including limb pixels and pixel timestamps, while maintaining compatibility with the operational products. The datasets currently available span the operational life of GOES-16 ABI, from early 2018 through the end of 2024. The Reprocessed L1b dataset shows improvement over the Operational L1b products but may still contain data gaps or discrepancies. Please provide feedback to Dan Lindsey (dan.lindsey@noaa.gov) and Gary Lin (guoqing.lin-1@nasa.gov). More information can be found in the GOES-R ABI Reprocess User Guide.

NOTICE: As of January 10th 2023, GOES-18 assumed the GOES-West position and all data files are deemed both operational and provisional, so no ‘preliminary, non-operational’ caveat is needed. GOES-17 is now offline, shifted approximately 105 degree West, where it will be in on-orbit storage. GOES-17 data will no longer flow into the GOES-17 bucket. Operational GOES-West products can be found in the GOES-18 bucket.

GOES satellites (GOES-16, GOES-17, GOES-18 & GOES-19) provide continuous weather imagery and monitoring of meteorological and space environment data across North America. GOES satellites provide the kind of continuous monitoring necessary for intensive data analysis. They hover continuously over one position on the surface. The satellites orbit high enough to allow for a full-disc view of the Earth. Because they stay above a fixed spot on the surface, they provide a constant vigil for the atmospheric "triggers" for severe weather conditions such as tornadoes, flash floods, hailstorms, and hurricanes. When these conditions develop, the GOES satellites are able to monitor storm development and track their movements. SUVI products available in both NetCDF and FITS.
Datasets used in Transitive prediction of small-molecule function through...
figshare.com
csv
Updated May 14, 2025
+ more versions
Share
Facebook
Twitter
Email
Click to copy link
Link copied
Cite
Feng Bao (2025). Datasets used in Transitive prediction of small-molecule function through alignment of high-content screening resources [Dataset]. http://doi.org/10.6084/m9.figshare.29061038.v2
Explore at:
csvAvailable download formats
Unique identifier
https://doi.org/10.6084/m9.figshare.29061038.v2
Dataset updated
May 14, 2025
Dataset provided by
Figsharehttp://figshare.com/
figshare
Authors
Feng Bao
License
Attribution 4.0 (CC BY 4.0)https://creativecommons.org/licenses/by/4.0/
License information was derived automatically
Description
This dataset supports the development of CLIPn, a contrastive-learning framework designed to align heterogeneous high-content screening (HCS) profile datasets.GitHub link: https://github.com/AltschulerWu-Lab/CLIPnDirectory StructureFoldersraw_profilesHCS13/Contains raw data from 13 high-content screening (HCS) datasets. Each dataset includes meta and feature files.L1000/CDRP_feature_exp.csv: Raw L1000 expression data from the CDRP dataset.CDRP_meta_exp.csv: Metadata associated with the CDRP expression data.LINCS_feature_exp.csv: Raw L1000 expression data from the LINCS dataset.LINCS_meta_exp.csv: Metadata associated with the LINCS expression data.RxRx3/RxRx3_feature_final.csv: Profile data from the RxRx3 dataset.RxRx3_meta_final.csv: Metadata from the RxRx3 dataset.Uncharacterized_compounds/NCI_cpnData.csv: Feature data for uncharacterized compounds from the NCI dataset.NCI_cpnInfo.csv: Information about uncharacterized compounds in the NCI dataset.Prestwick_UTSW_cpnData.csv: Feature data for uncharacterized compounds from the Prestwick UTSW dataset.Prestwick_UTSW_cpnInfo.csv: Information about uncharacterized compounds from the Prestwick UTSW dataset.Data ReferenceFor raw datasets from 13 HCS database, data and analysis pipeline for dataset 1 was obtained from https://www.science.org/doi/suppl/10.1126/science.1100709/suppl_file/perlman.som.zip; for datasets 2-3, data were shared by authors; For datasets 4-5, analysis code was downloaded from https://static-content.springer.com/esm/art:10.1038/nbt.3419/MediaObjects/41587_2016_BFnbt3419_MOESM21_ESM.zip and data were shared by authors; For datasets 6-7, processed dataset was downloaded from AWS following instructions from https://github.com/carpenter-singh-lab/2022_Haghighi_NatureMethods, and replicate_level_cp_normalized.csv.gz features were used. For project datasets 8-13, datasets and analysis results were downloaded from https://zenodo.org/records/7352487. For RxRx3, dataset was obtained from https://www.rxrx.ai/rxrx3. L1000 transcript datasets were downloaded using the same link as datasets 6-7 and the processed transcript data files (named “replicate_level_l1k.csv”) were used.

Facebook

Twitter

Click to copy link

Link copied

Cite

Amazon Web Services (2021). Registry of Open Data on AWS [Dataset]. https://registry.opendata.aws/registry-open-data/

Registry of Open Data on AWS

Explore at:

Dataset updated

Aug 13, 2021

Dataset provided by

Amazon Web Serviceshttp://aws.amazon.com/

License

Apache License, v2.0https://www.apache.org/licenses/LICENSE-2.0
License information was derived automatically

Description

The Registry of Open Data on AWS contains publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals. This dataset contains derived forms of the data in https://github.com/awslabs/open-data-registry that have been transformed for ease of use with machine interfaces. Currently, only the ndjson form of the registry is populated here.

Clear search

Close search

Google apps

Main menu

Registry of Open Data on AWS

Amazon review data 2018

Context

Acknowledgements

PIPr: A Dataset of Public Infrastructure as Code Programs

Amazon Bin Image Dataset File List

Amazon Bin Image Dataset

Documentation

Download

Phrase Clustering Dataset (PCD)

Amazon AMZScout PRO Data + Description

Depreciation Notice

Brief explanation of the columns

Attribution

Generation/updating

Open Targets - Data Lakehouse Ready

Amazon Bin Image Dataset (536,434 images, 224x224)

Amazon Bin Image Dataset

Documentation

Download

AWS CLI and boto3 github issues

Context

Content

Considerations

Amazon_Sports_and_Outdoors_2023

Amazon tracking dataset personpath22

OpenWings OpenData

AWS Public Blockchain Data

Datasets

Become a Data Provider

Developer Community and Code Datasets

Intel Product Reviews Dataset

AI4Life-MDC24 Challenge data: JUMP Cell Painting Datasets

Genome Aggregation Database (gnomAD) - Data Lakehouse Ready

Ground truth labels - Amazon movie reviews dataset

Addition of ground truth labels on Amazon movie reviews

http://i.imgur.com/aDVUwMz.png" alt="Image">

What is it?

The original dataset

The new labeled dataset

Instructions

NOAA Geostationary Operational Environmental Satellites (GOES) 16, 17, 18 &...

Datasets used in Transitive prediction of small-molecule function through...

Registry of Open Data on AWS